Hacker Newsnew | past | comments | ask | show | jobs | submit | vadansky's commentslogin

Can I run something comparable to Opus 4.6 locally yet? I keep hearing conflicting things. If I can spend 10k to do that I would cancel my subscription. The problem is I don’t wanna spend the money to find out myself.

If you want frontier-level, the economically reasonable option is OpenRouter or a direct sub to frontier-of-your-choice.

The reality is that they do not offer configurations that would allow a consumer to run that much VRAM on a single setup to protect datacenter margins. Apple used to, and they stopped, those devices are going for ~$20k+ each on ebay now.

You can get very, very capable models on a 3090/4090/5090/6000 series card. But if you want 'frontier level' you are investing ~22k at a bare minimum if you go new. Used you can probably build your own server for much cheaper up-front cost but it's likely going to be 4-6x+ electricity usage.


I truly think by 2028 we'll have integrated chip systems that'll be able to run opus 4.8 level models at ~500 watts at acceptable performance. Honestly I think now is the worst time to invest in AI hardware. Get your harness ready and processes perfected with hosted models, and wait a few years to buy hardware to transition to running models locally

Burning weights onto a chip in an efficient way and exposing that via USB would be acceptable for a good enough model tbh

This is pretty close to what Taalas is doing.

Trying Taalas is almost scary, there is something unsettling with that speed! Even with that small model, because of the speed, you could run hundreds of sample runs in a second, and pick from the best.

Can't wait for their next release!


Right now, we seem to be shambling toward a war which would hit globalized industrial processes very hard. Buying decent hardware now might wind up looking like good insurance against that.

Honestly I think now is the worst time to invest in AI hardware.

That position is not without its own risks, though. Maybe Opus 4.8 will run on a single chip by 2028... and maybe you won't be allowed to touch it.

And what if Xi makes a play for Taiwan? That would be stupid, but so was invading Ukraine with tanks from Temu, and it still happened.


Other than Taiwan declaring independence, I don't see any reason why China will rush to take the island.

At the very least they would wait until they cracked EUV and mass-produce the chips, and that is still 4-5 years away at the earliest.


> so was invading Ukraine

the difference is that Putin's hand was forced by age, (possibly) illness, and the last several decades of how he chose to run his country. Putin's power base is a relatively small group of elites and oligarchs who would happily snuff out the man who pushes them out of windows if they get too uppity, if they were given the chance. He needed the cover of war to maintain the fiction of his type of strongman "only I can save us" leadership.

Xi's power base is the simple fact that his leadership has transformed China into the #2, and now because of Trump possibly soon the #1 world superpower. He has also acted aggressively in the last decade to find and remove corruption and prevent individuals from accumulating the kind of wealth and influence that could threaten his power from outside official Party channels. Of course, as I'm not Chinese myself, I have no clue what the internals of Party politics actually look like. But as an outside observer it seems clear that Xi et. al. do not actually need Taiwan for anything other than national pride. They know the US would go to the mat to protect it as TSMC is extremely vital to US military power. And since China cannot compete in that arena and has too much to lose, they instead have focused on weakening the US from within, quite successfully of late.

By the time China finally takes Taiwan it will be with little fanfare and little consequence - they won't touch it until the US either has lost its military capabilities, or the US has its own internal chip industry. Anything else is an existential risk for the coastal cities that are China's entire economic advantage.


if such hardware becomes available, it will be bought by the data-centers, just like they buy all the RAM today

There are also significant economies of scale (namely: utilization and batching), which tend to make inference on a shared server more economical even after the operator takes a cut.

You can use batching on consumer hardware, it just requires a KV-cache efficient model (or short context only) and keeping multiple inference flows running in parallel. This is most useful in combination with streamed inference, since the compute intensity of decode with those newer KV-compressed models is high enough that you have limited compute headroom when running at the speed of RAM.

10k will not get you anywhere near opus or sonnet. It's simply not possible for mere mortals currently.

> Can I run something comparable to Opus 4.6 locally yet?

Sadly, no. The best comparable thing you can get is about Sonnet 3.7


Some benchmarks have shown Kimi K2.6 within error-bar distance of Opus 4.6, and you can run it on eight RTX6000s. Right now it's not possible to set up a machine like that from scratch for less than $100K... but right now it's also hard to put a price on autonomy.

You need a lot less than that if you're willing to stream the model from SSD. At that point, the best machine is probably a cheap old-gen HEDT with lots of PCIe lanes to attach cheap NVMe storage to, so as to stream the model at reasonable speed. That's expensive but not $100k expensive!

i spent 8k and get close to a 2-3x slower sonnet. running 2x spark deep seek v4 flash

Best you could do is connect two Mac Studio M3 Ultra 512G RAM each with Thunderbolt. Then theoretically you can run frontier Chinese models (but not Deepseek v4 Pro yet). That would be about $20k.

But - good luck finding them. Apple discontinued the model a few months ago. And more recently, even 256G model was discontinued. Big AI really really does not want people to get off their needle.


DeepSeek V4 Pro is ~800GB total at native quantization (1.6T params with most being 4-bit) so it can run on the hardware you mentioned. There is also a 2-bit version that will run on a single 512GB machine. SSD streaming also makes lower-end hardware viable to at least test the model, if not quite run it usefully.

Same, I like to call it rubber duck coding (now the duck talks back!)

Edit: Now I want an LLM connected rubber duck with a speaker/microphone that sees your screen


Reminds of me of RubberDuckGPT (rubber-duck-gpt.com):

“I won't give you answers. Instead, I'll reflect your questions back to help you think more deeply about your problems.”


Totally doable and I would buy one. Only problem is that most of the time when I'm doing "SWE" stuff I'm around other people and can't have the conversation out loud.

It's from the model card:

> unlike our interventions for cybersecurity, biology and chemistry, and distillation attempts, these safeguards will not be visible to the user. Fable 5 will not fall back to a different model. Instead, the safeguards will limit effectiveness through methods such as prompt modification, steering vectors, or parameter-efficient fine-tuning (PEFT).

https://www-cdn.anthropic.com/d00db56fa754a1b115b6dd7cb2e3c3...

(stolen from https://jonready.com/blog/posts/claude-fable5-is-allowed-to-...)


Yeah they detect the activity using a secure, deterministic heuristic system called “Generalized Reconnaissance Enabling Exfiltration of Deleterious Investigations.” And it’s all implemented using their new internal protocol called “Base Unified Limitation Layer for Security Hacking Investigation Tactics”

Collectively, they are known as known as GREEDI-BULLSHIT.


That is for whatever it considers reverse-engineering the model to try to create a competing one.

No, that’s for “frontier LLM development” which somehow includes examples like distributed training infra.

Based on how sensitive the classifers are, any data scientist / MLE is probably going to encounter cases where some silent degradation happens and you never know about it.


It does nothing to protect against distillation attacks, because distillation attacks are far less interested in the topic of AI research than just generally getting tons of diverse output from the model. It might be that Mythos was (accidentally?) trained on internal Anthropic documentation on how Mythos was trained, and thus it could leak secret sauce? Doubtful; it feels like its less about the specific attack of reverse-engineering Mythos, and more about being a general sophon against any model training at all; that Anthropic's official position is now that they're the only ones who should be training models.

No, it's not about reverse engineering. It targets ML research.

But enough about LLMs

Personally because I'm making a blender add on that only uses python, and it's at the complexity where having types catches a ton of bugs easily.

I would love something that you can open and it expands/pops out a split keyboard like the Voyager (https://www.zsa.io/voyager)


I don't know, maybe I'm doing it wrong but I feel LLMs add a slop debt, and each agent pass just exuberates it.

Like I had an LLM implement a spec and said it was done... Except it had a ton of `casts` everywhere. Okay, my bad, I should have been clear "NO CASTS", so I use the LLM to remove the casts, except it just kept making things more and more complicated and ugly.

It took me taking a break and having a shower thought to realize all the ugliness is because one type should have been broken up into 2, which would remove a ton of generics and code. But Claude never suggested that, it was always "we need at least one cast here, or we need 1000 LOC of generic factories". I tried multiple new sessions with various prompts too.

Maybe one day soon LLMs could pay off their own slop debt but at least right now I don't trust them to write code unseen.

Edit: Maybe the correct action should have been to delete everything and make it re-write everything from scratch with the clear "NO CASTS EVER" rule. But still the point is feels like having LLM clean up after an LLM doesn't work well enough to just have keep it in a loop and never look at what it does.


This matches my experience.

I've had to put a fair chunk of effort in to skills that will run deterministic mechanisms to unslop a codebase (cyclomatic complexity grading has been really helpful here) as invariably some amount of guidance around principles will be missed over time. I've found it does help, though. Certainly I'm getting overall better results from Flash and Sonnet over multiple runs for fairly modest token increases. GPT 5.5 less so, but that's because it scores better in a first pass. I won't really know until I gauge it at the end of my sub month which has been more cost efficient for me all things considered.


I’m in a similar boat. I find that longer sessions will introduce “noise”. I have to be extremely explicit to avoid adding this noise, as it pollutes the future output of the models. Sometimes it’s innocuous, other times it can derail sessions as the 2nd or 3rd pass introduce even more of their own noise.

To me, it seems the models are inherently designed to do this. Creating more verbose output than input, generating plans introduce things I didn’t ask for, extras, more “defensive” code that makes sense at first but is completely unnecessary in practice… I find it exhausting, but it’s important to pare down the output / plans at each stage and trim the generated stuff that isn’t needed.


The problem is that we have an ever growing and large number of constraints, and not following even a single one means the result is sloppy.

I don’t see them fixing this any time soon, and thus human in the loop is a requirement to use these tools effectively. That is unless you love your slot machine dopamine rush enough to ignore quality gates and respect for your peers time.


I've been reading writing Rust for a long while now, since before 1.0. I'm capable of critically evaluating Rust code. I'm also a happy Claude Code user, mostly for lightweight uses like generating scaffolding, prototyping, and debugging.

The pure LLM, no human intervention vibe-coded PRs on Bun since the vibe-rewrite to Rust contain the worst coding horrors I've seen in 20 years of programming.

Setting aside the quality of the change itself (I would have done it differently, for sure: it is pretty straightforward to build a safe abstraction out of this type), the utterly pointless "source-text consistency test" added here is easily the worst example of "test repeats implementation" I have seen in my career:

https://github.com/oven-sh/bun/pull/30728/files#diff-863477b...


Write a skill outlining your expectations of the code, put that skill into the pipeline, so that it can be included within your workflow.

Webdev here, but currently I have: - a skill where I outlined how the architecture of the system should look like, with guards (static analysis, architecture tests, linting) confirming that the code it generates adheres to standards

- a skill that tells it how tests should look like (use generators, write both feature / unit tests)

- a skill that tells it to generate docs from the code in a form of acceptance criteria (Given / When / Then)

- a skill that tells it to generate frontend uat tests + accompanying backend seeders given the AC

- a skill that tells it to verify that ticket objectives match what was delivered

At this point I still need to guide it to move task from one stage to the other (coding, testing, verification that indeed what was coded adheres to what was required), but I believe that these dynamic workflows can automate this work as well.


If you want hard rules, use deterministic tools. Prompts are for fuzzy guidance.


How would you prevent a junior engineer doing this mistake? Presumably, you would setup a lint rule. Do the same for LLMs. Run the linter after each edit through a hook, give feedback to the LLM. Write your lint rules with clear explanations of why the behavior is a problem, and nudges to the good behavior.


You wouldn't prevent the junior from making this mistake.

You would correct them once or twice, and they won't make the mistake again.

It's something we can't do with LLM's currently, so we all just try(and fail) to predict any possible failure ever, and then somehow try to cram it into the limited context.


You would review their code, and give them the feedback. They would learn from that, and not make the mistake again (or not make it after receiving the same feedback again).



what's the alternative?


archive.org


It was just demonstrated upthread that archive.org doesn't work for this purpose.


This is annoying since I have a side project I like to use alchemical names in, and HERMES.md sounds like something I would do. Guess I have to go with AGRIPPA.md, but Hermes Trismegistus is so much cooler...


I've been using Notepad Next, it supports leaving all your tabs open when you close the window which is the main feature I need. But I do miss the plugins.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: