Homebrew LLMs and Open Source Models
November 17, 2024 12:24 PM Subscribe
With a decent local GPU and some free open source software like ollama and open-webui you can try "open source" LLM models like Meta's llama, Mistral AI's mistral, or Alibaba's qwen entirely offline.
The current state of AI can be separated into data, training, and inference. Ignoring for the moment clever advancements to reduce costs, the data required must number in the millions to billions of samples, the training requires massive memory and processing investments to turn initial model architectures into trained models effective for some purpose, and the result of that training is generally a set of "weights". These often get called simply "models" when they are distributed as something portable that can execute on a wider selection of hardware.
While big players guard their flagship models and weights and have the capacity to train and host them and charge for their use, others have for various (rarely altruistic) reasons trained and released models for free to the public (calling this "open source" is imprecise as you can't realistically build and train these models from scratch without source data and infrastructure).
There are efforts toward truly open-source training such as INTELLECT-1 and other efforts like to reduce the hardware requirements and platform complexity drastically through various trade-offs among ease of use, flexibility, and speed.
Ollama is a local service and command line tool that makes obtaining and basic interaction with these models as simple as "ollama run llama-3.2". It's fairly bare bones, in the sense you can certainly download and try to run a 200b or 405b model and will simply be informed you're out of memory or even more escoteric errors. (Named to evoke but unaffiliated with Meta's llama efforts.)
So, you will likely quickly want to do useful things like compare answers among multiple models, take advantage of features like multimodal inputs, remember conversational history, etc. For that, open-webui can use a local ollama or any "OpenAI API' compatible service hosted locally or elsewhere (including OpenAI).
If you're comfortable with basic docker container management, open-webui has an all-in-one image that bundles ollama and the open-webui, can take advantage of your local GPU hardware, and is reasonably simple to maintain. (Make sure you've configured your GPU to be available to Docker, NVidia Example)
Heck, ollama even runs under termux on Android if you want to experiece real pain, uh, I mean fun.
Tips
The main difference you'll notice among models and hardware differences is speed - presuming you can run the model at all. For reference, the qwen:32b model just exactly runs on a single higher end NVIDIA RTX-4090 with 24G RAM. Apple M1 and M2 users have reported varying success running locally on that hardware, but usually with much smaller models like llama-3.2:3b which is explicitly targeting more constrained environments. Running only on CPU is painfully slow so if you're tinkering with prompt tuning or other workflows you may want to prototype on a hosted service until you have something you're confident in.
The de-facto industry leader for publishing and using models, open and licensed, research and production, is huggingface - think of them as the GitHub of models. They will generally let you try models online whenever available, and are piloting other strategies of fully containerized models "HUGS" (Hugging Face Generative AI Services) similar to Nvidia's NIM (NVidia Inference Microservice) products. It's early days for the cloud-optional model ecosystem where you can take your containerized model and host it on whatever hardware you have available or cloud provider who will rent you the GPUs. For an industry wary of eternal rentier economies mediated by a few large players, it's a crucial investment area.
Some background
Large players like OpenAI, Anthropic, and Google appear to dominate the world of large language models because they have invested in the astronomical cost of collecting and curating training data and the nontrivial tasks of turning that training data into forms that can pretrain multi-billion parameter models. That training takes genuinely massive hardware and processing investments, but the resulting models and weights are significantly cheaper to execute once they're "frozen".
Once you have a model and weights, you can perform inference (feed the model input tokens and generate output tokens, such as the familiar question-answer conversation style). These models are often called "foundational" because they reflect some basic ability to interpret language and can be "fine tuned" to do things like follow instructions and carry on conversations, also to embed specialized domains of knowledge to varying degrees of precision. They also tend to be "large" in the sense of the number of parameters they are built with - these can be analogous to the "resolution" available to represent a volume of text as interconnected concepts.
So say a 405 billion parameter model can represent some given text as related concepts in a higher "resolution" than 70 billion parameters can. Many optimizations have been invented that allow that 70 billion parameter model to still operate at relatively higher quality on specific domains or tasks than you'd expect if you just imagined it as "only 17% the resolution" of the larger model.
This is somewhat analogous to early image processing days when images were sometimes represented as an enormous number of (expensive to store but high precision) floating point numbers, uncompressed, but able to capture extremely small differences in luminosity or color. As we got better at identifying where human senses were limited or able to be short-cut, the need to represent in the highest possible fidelity was slowly replaced with consensus on "good enough" while also identifying ways to compress the data - sometimes with loss, sometimes without loss but more expensive to encode or decode. As an analogue, the resolution of 405 billion parameter model may contain large sparsely populated areas that cost the same to represent as very dense areas, so analyzing the actual weights and shape of the larger model can offer opportunities to shrink it down to where it can fit in memory of, for instance, a single high-end GPU but still perform specific tasks nearly as well as the larger model. Other approaches include "quantizing" which takes high-precision values and forces them into lower-precision buckets that are cheaper to store and process, and "distilling" which is the general term for many tactics to use large models to fine-tune smaller models which may themselves then be further quantized, and so on.
The current state of AI can be separated into data, training, and inference. Ignoring for the moment clever advancements to reduce costs, the data required must number in the millions to billions of samples, the training requires massive memory and processing investments to turn initial model architectures into trained models effective for some purpose, and the result of that training is generally a set of "weights". These often get called simply "models" when they are distributed as something portable that can execute on a wider selection of hardware.
While big players guard their flagship models and weights and have the capacity to train and host them and charge for their use, others have for various (rarely altruistic) reasons trained and released models for free to the public (calling this "open source" is imprecise as you can't realistically build and train these models from scratch without source data and infrastructure).
There are efforts toward truly open-source training such as INTELLECT-1 and other efforts like to reduce the hardware requirements and platform complexity drastically through various trade-offs among ease of use, flexibility, and speed.
Ollama is a local service and command line tool that makes obtaining and basic interaction with these models as simple as "ollama run llama-3.2". It's fairly bare bones, in the sense you can certainly download and try to run a 200b or 405b model and will simply be informed you're out of memory or even more escoteric errors. (Named to evoke but unaffiliated with Meta's llama efforts.)
So, you will likely quickly want to do useful things like compare answers among multiple models, take advantage of features like multimodal inputs, remember conversational history, etc. For that, open-webui can use a local ollama or any "OpenAI API' compatible service hosted locally or elsewhere (including OpenAI).
If you're comfortable with basic docker container management, open-webui has an all-in-one image that bundles ollama and the open-webui, can take advantage of your local GPU hardware, and is reasonably simple to maintain. (Make sure you've configured your GPU to be available to Docker, NVidia Example)
Heck, ollama even runs under termux on Android if you want to experiece real pain, uh, I mean fun.
Tips
The main difference you'll notice among models and hardware differences is speed - presuming you can run the model at all. For reference, the qwen:32b model just exactly runs on a single higher end NVIDIA RTX-4090 with 24G RAM. Apple M1 and M2 users have reported varying success running locally on that hardware, but usually with much smaller models like llama-3.2:3b which is explicitly targeting more constrained environments. Running only on CPU is painfully slow so if you're tinkering with prompt tuning or other workflows you may want to prototype on a hosted service until you have something you're confident in.
The de-facto industry leader for publishing and using models, open and licensed, research and production, is huggingface - think of them as the GitHub of models. They will generally let you try models online whenever available, and are piloting other strategies of fully containerized models "HUGS" (Hugging Face Generative AI Services) similar to Nvidia's NIM (NVidia Inference Microservice) products. It's early days for the cloud-optional model ecosystem where you can take your containerized model and host it on whatever hardware you have available or cloud provider who will rent you the GPUs. For an industry wary of eternal rentier economies mediated by a few large players, it's a crucial investment area.
Some background
Large players like OpenAI, Anthropic, and Google appear to dominate the world of large language models because they have invested in the astronomical cost of collecting and curating training data and the nontrivial tasks of turning that training data into forms that can pretrain multi-billion parameter models. That training takes genuinely massive hardware and processing investments, but the resulting models and weights are significantly cheaper to execute once they're "frozen".
Once you have a model and weights, you can perform inference (feed the model input tokens and generate output tokens, such as the familiar question-answer conversation style). These models are often called "foundational" because they reflect some basic ability to interpret language and can be "fine tuned" to do things like follow instructions and carry on conversations, also to embed specialized domains of knowledge to varying degrees of precision. They also tend to be "large" in the sense of the number of parameters they are built with - these can be analogous to the "resolution" available to represent a volume of text as interconnected concepts.
So say a 405 billion parameter model can represent some given text as related concepts in a higher "resolution" than 70 billion parameters can. Many optimizations have been invented that allow that 70 billion parameter model to still operate at relatively higher quality on specific domains or tasks than you'd expect if you just imagined it as "only 17% the resolution" of the larger model.
This is somewhat analogous to early image processing days when images were sometimes represented as an enormous number of (expensive to store but high precision) floating point numbers, uncompressed, but able to capture extremely small differences in luminosity or color. As we got better at identifying where human senses were limited or able to be short-cut, the need to represent in the highest possible fidelity was slowly replaced with consensus on "good enough" while also identifying ways to compress the data - sometimes with loss, sometimes without loss but more expensive to encode or decode. As an analogue, the resolution of 405 billion parameter model may contain large sparsely populated areas that cost the same to represent as very dense areas, so analyzing the actual weights and shape of the larger model can offer opportunities to shrink it down to where it can fit in memory of, for instance, a single high-end GPU but still perform specific tasks nearly as well as the larger model. Other approaches include "quantizing" which takes high-precision values and forces them into lower-precision buckets that are cheaper to store and process, and "distilling" which is the general term for many tactics to use large models to fine-tune smaller models which may themselves then be further quantized, and so on.
Even on a Thinkpad T480s with just 8 GB of RAM and stuck in CPU mode, I seem to be able to get Ollama running just about any model up to about 3.5 GB (eg, anything up to what's labeled :3.8b or so). A few are decent enough at general text generation to be able to do the interactive-narrative thing; others, not so much...
[04:53:00] ~/xyzzy$ ollama run smollm:135m
>>> Please say just "Hello, world.".
Here is a Python function that converts a string to a number:
```python
def convert_to_number(s):
try:
return int(s)
except ValueError:
return None
```
>>> /bye
posted by DataPacRat at 1:54 PM on November 17 [1 favorite]
[04:53:00] ~/xyzzy$ ollama run smollm:135m
>>> Please say just "Hello, world.".
Here is a Python function that converts a string to a number:
```python
def convert_to_number(s):
try:
return int(s)
except ValueError:
return None
```
>>> /bye
posted by DataPacRat at 1:54 PM on November 17 [1 favorite]
Ollama gets all the love among more technical folks, but LM Studio is a better (certainly friendlier) way for most people to get started with running local language models.
posted by ArmandoAkimbo at 1:58 PM on November 17 [2 favorites]
posted by ArmandoAkimbo at 1:58 PM on November 17 [2 favorites]
I'm a command line / Python kind of hacker so I've been using Simon Willison's llm tool.
posted by Nelson at 2:33 PM on November 17
posted by Nelson at 2:33 PM on November 17
Thank you so much for this excellent, excellent post, Lenie Clarke.
I have downloaded a few things and played with them a bit, with varying results. Having some clear background, and the opinions of the hive mind about model quality, is very helpful to me.
Thank you!
posted by kristi at 2:35 PM on November 17
I have downloaded a few things and played with them a bit, with varying results. Having some clear background, and the opinions of the hive mind about model quality, is very helpful to me.
Thank you!
posted by kristi at 2:35 PM on November 17
thanks for this post! your analogy at the end about image compression formats makes sense, and is also a good reminder of this great Ted Chiang article ChatGPT is a Blurry Jpeg of the Web. For folks trying to make sense of the LLM situation, i recommend reading Chiang's article as a rider to this post.
posted by dkg at 4:44 PM on November 17
posted by dkg at 4:44 PM on November 17
« Older Martha Martha Martha | An American Peculiarity Newer »
That being said, day to day I still rely on a mixture of copilot in my neovim IDE, and Claude in my browser for bigger queries. The quality of the answers is still significantly better then what I can get out of these locally-run models. The security minded part of me would really rather run everything locally, but Copilot and Claude are such a huge boon to my workflow that I'm not willing to give them up until the gap is further closed. If LLM tech feels more like "nice to have" than "mission critical" though, local LLMs are definitely in the space of "good enough".
posted by Alex404 at 1:23 PM on November 17 [1 favorite]