Homebrew LLMs and Open Source Models
November 17, 2024 12:24 PM Subscribe

With a decent local GPU and some free open source software like ollama and open-webui you can try "open source" LLM models like Meta's llama, Mistral AI's mistral, or Alibaba's qwen entirely offline.

The current state of AI can be separated into data, training, and inference. Ignoring for the moment clever advancements to reduce costs, the data required must number in the millions to billions of samples, the training requires massive memory and processing investments to turn initial model architectures into trained models effective for some purpose, and the result of that training is generally a set of "weights". These often get called simply "models" when they are distributed as something portable that can execute on a wider selection of hardware.

While big players guard their flagship models and weights and have the capacity to train and host them and charge for their use, others have for various (rarely altruistic) reasons trained and released models for free to the public (calling this "open source" is imprecise as you can't realistically build and train these models from scratch without source data and infrastructure).

There are efforts toward truly open-source training such as INTELLECT-1 and other efforts like to reduce the hardware requirements and platform complexity drastically through various trade-offs among ease of use, flexibility, and speed.

Ollama is a local service and command line tool that makes obtaining and basic interaction with these models as simple as "ollama run llama-3.2". It's fairly bare bones, in the sense you can certainly download and try to run a 200b or 405b model and will simply be informed you're out of memory or even more escoteric errors. (Named to evoke but unaffiliated with Meta's llama efforts.)

So, you will likely quickly want to do useful things like compare answers among multiple models, take advantage of features like multimodal inputs, remember conversational history, etc. For that, open-webui can use a local ollama or any "OpenAI API' compatible service hosted locally or elsewhere (including OpenAI).

If you're comfortable with basic docker container management, open-webui has an all-in-one image that bundles ollama and the open-webui, can take advantage of your local GPU hardware, and is reasonably simple to maintain. (Make sure you've configured your GPU to be available to Docker, NVidia Example)

Heck, ollama even runs under termux on Android if you want to experiece real pain, uh, I mean fun.

Tips

The main difference you'll notice among models and hardware differences is speed - presuming you can run the model at all. For reference, the qwen:32b model just exactly runs on a single higher end NVIDIA RTX-4090 with 24G RAM. Apple M1 and M2 users have reported varying success running locally on that hardware, but usually with much smaller models like llama-3.2:3b which is explicitly targeting more constrained environments. Running only on CPU is painfully slow so if you're tinkering with prompt tuning or other workflows you may want to prototype on a hosted service until you have something you're confident in.

The de-facto industry leader for publishing and using models, open and licensed, research and production, is huggingface - think of them as the GitHub of models. They will generally let you try models online whenever available, and are piloting other strategies of fully containerized models "HUGS" (Hugging Face Generative AI Services) similar to Nvidia's NIM (NVidia Inference Microservice) products. It's early days for the cloud-optional model ecosystem where you can take your containerized model and host it on whatever hardware you have available or cloud provider who will rent you the GPUs. For an industry wary of eternal rentier economies mediated by a few large players, it's a crucial investment area.

Some background

Large players like OpenAI, Anthropic, and Google appear to dominate the world of large language models because they have invested in the astronomical cost of collecting and curating training data and the nontrivial tasks of turning that training data into forms that can pretrain multi-billion parameter models. That training takes genuinely massive hardware and processing investments, but the resulting models and weights are significantly cheaper to execute once they're "frozen".

Once you have a model and weights, you can perform inference (feed the model input tokens and generate output tokens, such as the familiar question-answer conversation style). These models are often called "foundational" because they reflect some basic ability to interpret language and can be "fine tuned" to do things like follow instructions and carry on conversations, also to embed specialized domains of knowledge to varying degrees of precision. They also tend to be "large" in the sense of the number of parameters they are built with - these can be analogous to the "resolution" available to represent a volume of text as interconnected concepts.

So say a 405 billion parameter model can represent some given text as related concepts in a higher "resolution" than 70 billion parameters can. Many optimizations have been invented that allow that 70 billion parameter model to still operate at relatively higher quality on specific domains or tasks than you'd expect if you just imagined it as "only 17% the resolution" of the larger model.

This is somewhat analogous to early image processing days when images were sometimes represented as an enormous number of (expensive to store but high precision) floating point numbers, uncompressed, but able to capture extremely small differences in luminosity or color. As we got better at identifying where human senses were limited or able to be short-cut, the need to represent in the highest possible fidelity was slowly replaced with consensus on "good enough" while also identifying ways to compress the data - sometimes with loss, sometimes without loss but more expensive to encode or decode. As an analogue, the resolution of 405 billion parameter model may contain large sparsely populated areas that cost the same to represent as very dense areas, so analyzing the actual weights and shape of the larger model can offer opportunities to shrink it down to where it can fit in memory of, for instance, a single high-end GPU but still perform specific tasks nearly as well as the larger model. Other approaches include "quantizing" which takes high-precision values and forces them into lower-precision buckets that are cheaper to store and process, and "distilling" which is the general term for many tactics to use large models to fine-tune smaller models which may themselves then be further quantized, and so on.

posted by Lenie Clarke (19 comments total) 94 users marked this as a favorite

This is an excellent write up on the state of this tech. All of it works quite well at this point and ollama is really nice. I have a 3090 and the answers I get out of the models with around 10B parameters are impressive.

That being said, day to day I still rely on a mixture of copilot in my neovim IDE, and Claude in my browser for bigger queries. The quality of the answers is still significantly better then what I can get out of these locally-run models. The security minded part of me would really rather run everything locally, but Copilot and Claude are such a huge boon to my workflow that I'm not willing to give them up until the gap is further closed. If LLM tech feels more like "nice to have" than "mission critical" though, local LLMs are definitely in the space of "good enough".
posted by Alex404 at 1:23 PM on November 17, 2024 [1 favorite]

Even on a Thinkpad T480s with just 8 GB of RAM and stuck in CPU mode, I seem to be able to get Ollama running just about any model up to about 3.5 GB (eg, anything up to what's labeled :3.8b or so). A few are decent enough at general text generation to be able to do the interactive-narrative thing; others, not so much...

[04:53:00] ~/xyzzy$ ollama run smollm:135m
>>> Please say just "Hello, world.".
Here is a Python function that converts a string to a number:

```python
def convert_to_number(s):
try:
return int(s)
except ValueError:
return None
```

>>> /bye
posted by DataPacRat at 1:54 PM on November 17, 2024 [5 favorites]

Ollama gets all the love among more technical folks, but LM Studio is a better (certainly friendlier) way for most people to get started with running local language models.
posted by ArmandoAkimbo at 1:58 PM on November 17, 2024 [6 favorites]

I'm a command line / Python kind of hacker so I've been using Simon Willison's llm tool.
posted by Nelson at 2:33 PM on November 17, 2024 [2 favorites]

Thank you so much for this excellent, excellent post, Lenie Clarke.

I have downloaded a few things and played with them a bit, with varying results. Having some clear background, and the opinions of the hive mind about model quality, is very helpful to me.

Thank you!
posted by kristi at 2:35 PM on November 17, 2024 [1 favorite]

thanks for this post! your analogy at the end about image compression formats makes sense, and is also a good reminder of this great Ted Chiang article ChatGPT is a Blurry Jpeg of the Web. For folks trying to make sense of the LLM situation, i recommend reading Chiang's article as a rider to this post.
posted by dkg at 4:44 PM on November 17, 2024 [5 favorites]

There's no shortage of options for different user interfaces from which to run the various LLM models.
Have previously mucked about with text-generation-webui and llama.cpp.
posted by Enturbulated at 7:21 PM on November 17, 2024

I've been using Notebook recently for a couple of ~50 year periodical project for personal research, but I've been trying to figure out the best approach for making a public LLM for a collection of 45 years worth of magazines from an org I'm involved with.
posted by drewbage1847 at 7:28 PM on November 17, 2024

But why?

Like, what should I do with these things? Ask them to please say just "Hello, world."? I know how to do that without all this superstructure.
posted by rum-soaked space hobo at 12:34 AM on November 18, 2024 [1 favorite]

[Thanks for this, Lenie Clarke! We've added it to the sidebar and Best Of blog!]
posted by taz (staff) at 3:11 AM on November 18, 2024 [3 favorites]

But why?

Like, what should I do with these things? Ask them to please say just "Hello, world."? I know how to do that without all this superstructure.

I think the "local" (i.e. "not connected to the FaceGoogSoft Plagiarism Machines") part of this is a bit disconnected from the larger question of "why bother doing LLMs at all". I might not be reading your question correctly, though.

The "local" part is more interesting. To me, anyway. Because I think it would be kinda nice to learn this new thing without having every single word I type hoovered up into FaceGoogSoft's training data. And since the better models require you to be logged in, probably associated with you. Forever. Which is ultimately the biggest value to the "local" models. At least as I see it.

I say that as a developer that's so far managed to avoid all this LLM garbage. But the new, vastly improved auto-complete in the paid version of IntelliJ is pretty damn nice. And more of that is coming, both nice and garbage.

It won't be avoidable forever. It will inevitably replace things you already know how to use. Some of it will be useful. But in any case if you want New Google to tell you how to e.g. tune a four string bass correctly that is probably going to come as much from humans knowing how to construct better prompts as model refinement.
posted by howbigisthistextfield at 3:55 AM on November 18, 2024 [6 favorites]

Wow Lenie, thanks for taking the time.

I have a non-GPU laptop, but I have a Google Colab account. Here is a way to run Ollama on Google's GPUs, then set up a secure tunnel from your local machine to the Google GPU Ollama instance.

What was taking ~24 hours to run on my local CPU is now running in 4-5 minutes on the Google GPU.
posted by superelastic at 4:24 AM on November 18, 2024 [4 favorites]

Another thanks, Lenie.

After getting docker on my personal laptop to play nice with its nvidia 2070 super, the bundled openwebui/ollama container appears to be working and doing LLM stuff. A local instance with everything not being scarfed up by FaceGoog is something I can see actually interacting with a bit more now.
posted by howbigisthistextfield at 6:51 AM on November 18, 2024 [2 favorites]

While looking into this I came across https://pinokio.computer/, an open source program that handles a lot of the installation details for AI applications. People who are interested in this might take a look at Pinokio. Also, the name makes me think the devs have an appropriately cautious attitude about LLMs and their limitations.
posted by Tehhund at 7:46 AM on November 18, 2024

Like, what should I do with these things? Ask them to please say just "Hello, world."? I know how to do that without all this superstructure.

I'm using hosted AIs all the time, every day. They are very useful tools for me. Three uses, not all well supported by running locally:

"Calculator for words". I regularly ask an LLM to improve my writing, or to suggest synonyms or alternate phrasing for things
Computer programming. Tuned LLMs are very good at doing grunt work and writing quick example code for APIs I've never used before.
Enhanced search. I regularly use Phind as a smart search engine with natural language queries. It gives useful synthesized answers. I don't think local LLMs can quite do this, at least not without more work.

I'm mostly OK with hosted LLMs. Where I'm excited about local ones is in being able to do bulk queries for free. I have a project to scrape thousands of textual descriptions of photos to extract street addresses. ChatGPT can kind of do this but I'm really limited by access and it's not quite good enough I'd want to pay. Very easy to do stuff in bulk with a local LLM.

Some folks like local LLMs because they can get ones tuned for their specific use case. There's a whole trade in "uncensored" LLMs. I suspect a lot of that is used for creepy porn conversations. But I'd love it just to work around copyright limitations. ChatGPT basically refuses to discuss poems with me, for instance, even poems long out of copyright. I appreciate how careful they're being but it'd be nice to have access to an LLM without guardrails sometimes.

One downside is I don't have the hardware to run a 70b model like ChatGPT does. Speaking of which... is there any easy way to estimate how much RAM a model needs? I know the basic standard is 16 bits per weight, so a 7B model is about 14GB of RAM. But LM studio is eager to let me download quantized models with 4 bits, 8 bits, etc and it gets hard to track.
posted by Nelson at 8:03 AM on November 18, 2024 [3 favorites]

Are there any guides for non-programmers to play with this, some kind of "Self-Hosted LLMs for Dummies"?
posted by star gentle uterus at 8:35 AM on November 18, 2024

It's not a guide exactly, but I found installing https://pinokio.computer/ to be pretty point-and-click, and their documentation is pretty good. I haven't used the LLMs mentioned in the FPP like llama or mistral, but I got some others working without doing anything especially technical.
posted by Tehhund at 9:27 AM on November 18, 2024 [3 favorites]

A timely article from the snarky but mostly quite readable and credible The Register covers the next logical step in this homebrew domain: fine tuning and beyond.

While I'm here, big thanks to Rvyar for inspiring this post and always approaching the topic with nuance and grace. I'm pleasantly surprised that while I was ruefully prepared for yet another MeFi AI hate-on, instead the response has been something closer to a little demystifying serving to separate the shittiness of walled garden big tech players from the useful applications of the technology.

But why?

Like, what should I do with these things? Ask them to please say just "Hello, world."? I know how to do that without all this superstructure.

This is a great question, I snipped off the part of this already long post on a recent personal project where I forced myself to really explore the edges of the technology and identify what factors (parameter count, foundational fine tuning for things like instruction following versus chat) had on quality.

My project was "I am hiring people and need to review a lot of resumes". I will bet you just jumped to the conclusion that I decided to outsource this job to the ultra smart LLM - this is a great example of what the hype created by those who wish to use the technology to exploit and perpetuate suffering led everyone to believe.

I'm very familiar with the impacts of unconscious and conscious bias, especially in doing things like reviewing a fairly low resolution representation of a candidate like a resume. In days gone by, I would consciously drag off screen or hide the part of the resume at the top with name and other information that might provide biasing signal when I did my initial scan and scoring on my hiring criteria - this is obviously imperfect.

Certainly you can do things with regular expressions and other kinds of processing that attempt to anonymize, summarize, or redact the data but those are generally operating from purely textual context. For instance, they can redact "boy" and "girl" from "Volunteered for the Boys & girls Club of Atlanta" under volunteering, but depending on their pre-training large language models might have enough context to understand the concept of the volunteer organization - so I thought maybe I could tag instances like that where proper names, pronouns, and potentially gendered or demographic or otherwise potentially biasing references exist in the resume then wrap them in a basic spoiler tag. So my initial parse of candidates' qualifications might reduce signal that could be biasing but still allow me to click or hover to find out specifics in cases where the signal value of information is higher than the risk of bias of observing it. (This also forced me to suffer the consequences of PDF to text parsing and how terrible your fancially formatted resumes look when reviewed in text at scale.)

I know there are plenty of companies who will sell this service, but none who I found will let you rate and understand their methodology, and given the profit motive, I am very doubtful they are being especially responsible and certainly not transparent, so I thought I would measure from both ends by doing it myself. (Also, those services would require me to navigate the requirements around sharing candidateinformation and so on, subject to the possibility that their licensing requires that my data is used to train their systems and on it goes.)

I already operate locally when reviewing resumes (and of course treat that data within security and privacy guidelines), and definitely do not want to somehow leak candidate information - so I wrapped the effort in some constrained containers and began combining prompt tuning with various models and manually reviewing the results. (Hence my offline explorations.)

They sucked.

But by combining several rounds of prompt tuning and the usual software engineering process of seeking out edge cases and filling in gaps in your primary processing with secondary pre-filtering and post filtering, I cobbled together something that works pretty well to inject a simple tag around potentially biasing terms then post-process that into HTML with spoiler tags that I can view locally. In the process, I found that some larger models had a decent understanding of the concept of institutions, schools, potential demographic indicators, and so on which could be used to score how biasing a given phrase or concept might be and in what category, which let me mark them up in different ways to give me visual indication of how biasing the information might be. This is also where I found injecting examples and concepts in the prompt had varying effects on quality (this is one form of retrieval augmentation).

Bear in mind, my main concern is quality and correctness in AI so I am also always considering how to validate what the machine scores and why, and in some cases the machine is able to provide a rationale. The rationale is often misguided garbage (" The candidate has the title senior researcher, which means they probably have experience in XY and Z") but even that can be guided within prompting and, presumably, with better fine tuning around concepts like "having a title does not necessarily mean you have all the skills related with that title".

So to answer your question, it's a tool that can in some cases go beyond simply manipulating text into bringing in some understanding of what some of the text may mean. Expecting a tool like that to "be factually accurate" or "not lie" is a bit like expecting a calculator to recognize when you are using the wrong unit to express the terms you're punching in. I still have to check its work but it can do a lot of work very quickly so I can focus on checking its conclusions and try to correct its faulty reasoning with guidance (and later with fine-tuning for this purpose).

So maybe you will have recipes you want to convert to common units or where you would like a summary of ingredients pulled from a wall of jabbering text. For larger online models, they have worked a great deal to make doing things like creating tables and lists based on concepts trained into the model, smaller offline models are not as good at this but with some help can do useful things like present data in one form in another form.

I'm a software person so anything I see you can do once automatically with decent results looks like something I could automate to do all the time as long as I have enough test cases to become confident that when it screws up I can figure it out and that it won't screw up that often such that it saves me time. Eg not just in one recipe, but maybe for every stupid recipe on the internet running locally in my browser to be turned into something quickly parsable for "do I have these ingredients and do I want to make this?" Does that kind of thing need to be offline? Couldn't I just call one of the big smart models to summarize the whole page for me? Sure, but as noted elsewhere, that means feeding walled gardens for free and maybe it's not recipes you want to summarize offline but private correspondence or your erotic fanfic or whatever else you don't want to feed to the beast.

Multimodal (not just text but imagery, sometimes video) offline models are also becoming more available to do some useful kinds of things like understanding imagery or converting an image into a text table, that kind of thing. It's still very homebrew early days so your mileage will definitely vary.

This whole project had a secondary motive to assess code generation and co-pilot technologies for how effective they were at building and, most importantly, refactoring and evolving applications that do things like process this resume text, interpret the results, score the quality of the results using the same or a different model and some guidelines, and so on. This is an area where the big hosted models are getting quite good but longer term projects with many changes can exhaust all the clever tricks they use to "remember" the whole project. (This also allowed me to use walled garden large models to help with the coding without having to share the sensitive data it processed.)

I have worked in AI a very long time and am thus very skeptical of treating it as anything approaching sentient, but I know a useful tool when I see one and I want to know the bounds and operating parameters of this one - especially when operating in an effectively offline environment without the profit motive.

I hope this helps.
posted by Lenie Clarke at 7:04 AM on November 19, 2024 [12 favorites]

Late in but really wanted to thank you for the thread, doubly so for also coming back with detailed and very interesting info on your use-case. And likewise wanted to echo your thanks to Ryvar, whose patient persistence in trying to cut through the noise has consistently served as a waymark and has played a large part in keeping me persevering through AI related posts on MF.
posted by protorp at 11:22 AM on November 19, 2024 [3 favorites]

« Older Martha Martha Martha | An American Peculiarity Newer »

This thread has been archived and is closed to new comments

MetaFilter

Homebrew LLMs and Open Source Models
November 17, 2024 12:24 PM Subscribe

Tags

Share

Homebrew LLMs and Open Source Models November 17, 2024 12:24 PM Subscribe

Tags

Share

Homebrew LLMs and Open Source Models
November 17, 2024 12:24 PM Subscribe