It gets to the point where what you do is the main question while payment is barely a minor concern way earlier than that point, at least in my experience. You don't need to be in the top AI research tier for that.
Back in my day Joe Blow wouldn't try anything as risky as a Twitter prompt, simply clicking an image link published within a message in some random forum and will scorch his pure soul with a goatsie. You don't want to google it, but I'm preety sure you can discuss it safely with ChatGPT.
You can type in whatever sequence of keys you want, but Amazon preety much created and shaped that Cloud thing, if you ever heard of it. And that's just one of their side-projects.
If you live in most cities in Italy you have to take a huge hit to your ability to get places (in a reasonable timeframe or at all) if you must do it with a car.
It's not strange at all, I was responding to a specific, incorrect claim. I even quoted the wrong claim in my earlier comment , and I'll repeat it again, with added emphasis
>>> humans are incredibly efficient, from an energy perspective, in anything we do, compared to machines
I simply provided contrary evidence to a well-defined, falsifiable claim. How is that strange?
Yes, but walking and moving on wheels is oranges and apples. It would be a relevant comparison if a robot with a movement mechanism based on two feet was more efficient than a human.
> in one assignment I remember comparing the energy outputs between the human and robot equivalents of different tasks, whether or not the robot was humanoid in how it was designed
So I think the point in this context is relevant, even if it's apples to oranges.
The point isn't that a humanoid robot walking is less efficient than a human walking, is that moving on a wheel is not the same thing as walking. For example, using wheels is not only less efficient it is barely usable for climbing rocks, going up the stairs and many other surfaces that makes the comparison irrelevant.
You could say that a robotic gun is much more efficient than a human in killing, that's another easy easy comparison of different tasks where robots win, but it totally miss the point.
I’ll admit, at first, I thought the human vs machine comparison was about humanoid machines. But that’s too narrowly defined to be a useful comparison. Most machines in use today are not humanoid.
Then to boldly claim that humans are more efficient at anything compared to a machine, just does not follow.
How is that any different from the pre-llm days, when Jim was using stackoverflow to build the largest crypto exchange in the world? Where's stackoverflow accountability?
At least for me, the answer is that despite the mistakes and the sheer annoyance the prose causes me, they are unbelievably useful. I accomplished multiple major achievements in the last two years that most probably wouldn't be possible at all, surely not within that timeframe.
The idea is that by the time you will have time and remember the clothes might be smelly and wrinkled. The issue is with the genius product manager that decided the washing machine should have the most annoying beep possible, repeating every minute whether you like it or not, until turned off. Luckily, some manufacturers do employ better product manager.
You have an answer on your page regarding "Should I pick 26B-A4B or 31B?", but can you please clarify if, assuming 24GB vRAM, I should pick a full precision smaller model or 4 bit larger model?
Try 26B first.
31B seems to have very heavy KV cache (maybe bugged in llama.cpp at the moment; 16K takes up 4.9GB).
edit: 31B cache is not bugged, there's static SWA cost of 3.6GB.. so IQ4_XS at 15.2GB seems like reasonable pair, but even then barely enough for 64K for 24GB VRAM. Maybe 8 bit KV quantization is fine now after https://github.com/ggml-org/llama.cpp/pull/21038 got merged, so 100K+ is possible.
> I should pick a full precision smaller model or 4 bit larger model?
4 bit larger model. You have to use quant either way -- even if by full precision you mean 8 bit, it's gonna be 26GB + overhead + chat context.
For the best quality reply, I used the Gemma-4 31B UD-Q8_K_XL quant with Unsloth Studio to summarize the URL with web search. It produced 4.9 tok/s (including web search) on an MacBook Pro M1 Max with 64GB.
Here an excerpt of it's own words:
Unsloth Dynamic 2.0 Quantization
Dynamic 2.0 is not just a "bit-reduction" but an intelligent, per-layer optimization strategy.
- Selective Layer Quantization: Instead of making every layer 4-bit, Dynamic 2.0 analyzes every single layer and selectively adjusts the quantization type. Some critical layers may be kept at higher precision, while less critical layers are compressed more.
- Model-Specific Tailoring: The quantization scheme is custom-built for each model. For example, the layers selected for quantization in Gemma 3 are completely different from those in Llama 4.
- High-Quality Calibration: They use a hand-curated calibration dataset of >1.5M tokens specifically designed to enhance conversational chat performance, rather than just optimizing for Wikipedia-style text.
- Architecture Agnostic: While previous versions were mostly effective for MoE (Mixture of Experts) models, Dynamic 2.0 works for all architectures (both MoE and non-MoE).
This is one of the more confusing aspects of experimenting with local models as a noob. Given my GPU, which model should I use, which quantization of that model should I pick (unsloth tends to offer over a dozen!) and what context size should I use? Overestimate any of these, and the model just won't load and you have to trial-and-error your way to finding a good combination. The red/yellow/green indicators on huggingface.co are kind of nice, but you only know for sure when you try to load the model and allocate context.
reply