Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

"roughly" is doing a lot of heavy lifting there
 help



The difference between datacenter hardware and cheap personal hardware is not in what can be run and what cannot be run.

Anything can also be run on a cheap computer.

The difference is in speed. A cheap computer may run a big model up to a few orders of magnitude slower than datacenter hardware, depending on whether the LLM is small enough to fit in GPU memory, or it is small enough to fit in CPU memory or it is so big that it must spill on SSDs.

Depending on the application, the tradeoff between run time and run cost may happen to favor using local hardware, despite a much slower speed.

There are plenty of applications where doing them for negligible cost during an overnight job can be preferable to obtaining faster results at a very high price, for instance scanning for bugs in a mature code base using a great number of different open-weights LLMs, which can achieve similar bug coverage like using a single, but overpriced and unavailable SOTA LLM, e.g. Mythos.


> The difference between datacenter hardware and cheap personal hardware is not in what can be run and what cannot be run.

You do realize that a model like Opus is (estimated to be) around 5T parameters, and uses around 5TB of GPU memory?

These kind of things are just impossible to run locally.


This kind of things can certainly be run locally, even on a small mini-PC, like a NUC, or even on a laptop, with the weights stored on SSDs.

Like I have said, the problem is not that they cannot be run, but that they may run more slowly than it is acceptable for a given application. Depending on the model, the speeds reported for inference with weights stored on SSDs vary from one token every few seconds to at most a few tokens per second.

Computers could solve relatively huge problems even in the early days of vacuum tube computers, when the main memories were measured in kilobytes, because at that time it was not expected that the data needed for problem solving must fit inside the main memory or even in the next tier of memory, with magnetic drums or magnetic disks, but the really big problems were solved by a great number of passes over data stored on magnetic tapes.

An LLM whose inference could not be run on a small mini-PC would have to be one hundred times bigger than the biggest existing SOTA LLMs.

Any LLM that exists today can be run on almost any PC, just extremely slowly in comparison with datacenter hardware.


When people say that you "can't do" something what they actually mean is that it's completely impractical (if not impossible).

Whether something is "impractical" depends on your expectations. High-latency unattended inference is definitely viable, even though it doesn't align much with what's being run in hyperscale datacenters.

I'd like to meet the person who's been using a 1 token/second system as their primary LLM for at least a few weeks. Anyone?

I think 1 token/second is optimistic here - and even then it's over 11 days per million tokens.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: