This is a good guide. Ollama and Llamafile are two of my top choices for this.
On macOS it's worth investigating the MLX ecosystem. The easiest way to do that right now is using LM Studio (free but proprietary), or you can run the MLX libraries directly in Python. I have a plugin for my LLM CLI tool that uses MLX here: https://simonwillison.net/2025/Feb/15/llm-mlx/
+1 for Llamafile. They just released a new release and it supports Deepseek R1 models. I have a blank Llamafile and load models into it using a .bat file that I edit as needed. I have a custom webui that I made that lets me save and organize my sessions.
Whisperfile is also amazing. I used it to transcribe a podcast I hosted 20 years ago and it was fast and the errors it made were acceptable.
Agreed on Ollama ( especially as a starting point since it makes a lot of the initial setup super easy for anyone new ). OTOH, you may quickly become aware of the limitations imposed with its defaults ( for example, low context -- to be fair, it can be adjusted ).
I found Mozilla's Transformer Lab quite nice to use, for very small and dabbling values of "use" at least. It encapsulates the setup and interaction with local LLMs into an app, and that feels more comfortable to me than using from the CLI.
Upon getting a model up and running though, I quickly realized that I really have no idea what to use it for.
Is there a spreadsheet out there benchmarking local LLM and hardware configs? I want to know if I should even bother with my coffeelake xeon server or if it is something to consider for my next gaming rig.
Its really not hard to test with llamafile or ollama, especially with smaller 7B models. Just have a go.
There are a bazzillion and one hardware combinations where even RAM timings can make a difference. Offloading a small portion to a GPU can make a HUGE difference. Some engines have been optimized to run on Pascal with CUDA compute below 7.0, and some have tricks for newer gen cards with modern CUDA. Some engines only run on Linux while others are completely x-platform. It is truly the wild-west of combinatorics as they relate to hardware and software. It is bewildering to say the least.
In other words, there is no clear "best" outside of a DGX and Linux software stack. The only way to know anything right now is to test and optimize for what you want to accomplish by running a local llm.
> Let’s be clear. It’s going to be a long time before running a local LLM will produce the type of results that you can get from querying ChatGPT or Claude. (You would need an insanely powerful homelab to produce that kind of results).
Anyone experimented with local LLMs and compared the output from ChatGPT or Claude? The article mentions that they use local LLMs when they're not overly concerned with the quality or response time, but what are some other limitations or differences to running these models locally?
I've been using phi4 a lot, in most cases to process incoming transcripts of recordings. based on keywords or keyphrases, these transcripts get sent with a custom prompt, or the prompt is in the transcript itself.
I'd say it's suprisingly good at exacting reminders and todos from natural language into json 'action objects', or turning what is essentially a run-on sentence of a transcript into a markdown formatted text.
What I've found most fun to play with is to get it to extract metadata like an 'anxiety score' and tags.
Overall, it's clearly 'dumber' than the hosted big models, and in my case I have to deal with a small context window. In general my 'vibe' is that I have to be clearer and more explicit, and it's usually better to just do multiple passes over the same text with very targeted questions.
Oh, and in my case I definitely notice my laptop screeching to a halt when it's processing a big transcript, but in my case I can specifically delay those jobs to a time where I'm not at my computer.
Nice, that's a cool use case. I like that the local model gives you more privacy when sharing potentially sensitive data like voice recordings with an LLM. I've been interested in hosting one locally, but I was curious what I would be giving up compared to the commercial models. It sounds like its still possible to get a reasonable result with some caveats. Thanks.
You will simply need a lot of GPU cores/VRAM. On my $4,000 Mac Studio M2 Ultra with 64GB of RAM, I can comfortably run deepseek-r1:32b, but a) load times can be annoying (i.e. if you are switching models for different tasks, or let them idle out) and b) you can certainly tell that it requires tuning of the context length, temperature, etc. based on what you need to do.
Compare that with the commercial models where a lot of that is done on a large scale for you.
Yeah that makes sense. Once the model is loaded though, does it work well in comparison to the commercial models? Do you find that the local models hallucinate more, or don't give the same response quality?
What we need is a platform for benchmarking hardware for AI models. With X hardware you can get X amount of tokens with X amount of latency For context token pre-filled. So, standard testing methodology per model with user-supplied benchmarks. Yes, I recognize there's going to be some variability based on different versions of the software stack and encoders.
End user experience should start by selecting the models of interest to run and output hardware builds with price tracking for components.
Agreed. I opened the comments to write that nearly all of those articles spend very few words on the hardware, its cost and the performances compared to using a web service. The result is that I'm left with the feeling that I have to spend about $1,000 plus setup time (HW, SW) and power to get something that could be slower and less accurate than the current free plan of ChatGPT.
I'm sorry to use HN as Google or even ChatGPT, but are these systems just for LLMs?
I'm wondering about multi-modal models, or generative models (like image diffusion models). For example, I was wondering about noise removal from audio files and how hard it would be to find open models that could be fine tuned for that purpose and how easy they would be to run locally.
On macOS it's worth investigating the MLX ecosystem. The easiest way to do that right now is using LM Studio (free but proprietary), or you can run the MLX libraries directly in Python. I have a plugin for my LLM CLI tool that uses MLX here: https://simonwillison.net/2025/Feb/15/llm-mlx/