More

numeri · 2026-06-15T02:38:56 1781491136

One context I could imagine is a young person with shaky grasp of English trying to come up with an interesting school/university project via conversations with an LLM set up as an OpenClaw agent.

It's got the right combinations of inexperience, cluelessness, panic, expectations that Westerners are rich, and hopes of others being willing to fix their mistake.

numeri · 2026-06-12T22:47:14 1781304434

especially because this is the most painfully glaring flaw in their plan. Their solution is for an inference provider to... store the KV cache (which they can compute!) on-premise, on their own disks, but pay some third party for it?

mistercow · 2026-06-12T23:28:26 1781306906

Well, it’s one flaw. I would argue that the bigger flaw, which you alluded to, is that the cost of computing the cache yourself maxes out in the single digit dollars even very large frontier models, and that’s a one-time cost. Even if you imagine all the logistics are free and all the transfers are instant, what are we even talking about here from an economic perspective?

KV caching is a super interesting engineering space, especially when you’re talking about local models where compute and memory bandwidth are highly constrained and you’re trying to trim fractions of a second everywhere you can by flipping between different ICL prefixes. But selling caches for specific documents just makes no sense at all.

numeri · 2026-06-12T17:16:11 1781284571

I've had it happen. I ran an experiment, taking a couple hours and producing ~2 GiB of files. One of the results looked good, so I told Claude Opus 4.5 (at the time) to commit the code changes, upload the important file to cloud storage, then clean up the rest.

I then saw it run `rm -r results/`, before messaging me: "Now all that's left is for you to upload the successful results, then I'll delete the rest!"

Why did it not upload the files itself, when it had been using the cloud storage CLI during that session? No clue. I do accept that I could have and should have just uploaded the file myself. It would have taken 3 seconds to type.

numeri · 2026-06-11T20:36:59 1781210219

To be fair, it is good to know that it disobeys simple instructions like "don't examine my git history" far more than other models. (It should of course be a different benchmark, so as not to conflate things.)

It's not a great sign for alignment.

bensyverson · 2026-06-11T21:02:47 1781211767

Agreed, alignment is just a separate issue that a vuln fixing benchmark doesn't need to be testing.

numeri · 2026-06-05T05:28:09 1780637289

I would just warn that you may not be able to recognize what is worth learning at your stage.

Intuition for library design and the architecture of software packages/external APIs is something you can only learn by doing.

numeri · 2026-04-15T20:51:40 1776286300

I have DSPD as well, and was pleasantly surprised to see how much of the article discussed DSPD.

That being said, I do think a lot of what the author is saying flies right in the face of traditional advice, esp. the suggestion that we should all just free-sleep and rotate around the clock. I personally find myself happiest when I'm entrained to the 24-hour cycle, but at my own natural offset. Whenever I've been cycling the day it's felt miserable, uncontrollable and exhausting.

To be fair, the author did claim that you can fully solve this by completely cutting out after-dark electronics, but I've tried pretty intensely to do exactly that for extended periods in the past, and didn't see any progress. I do sleep amazingly when camping, though, and the delay is lesser than normal (still definitely there).

numeri · 2026-04-06T23:36:53 1775518613

11/20 for qwen/qwen3.5-flash-02-23 in Claude Code, with effort set to low.

numeri · 2026-03-09T13:53:22 1773064402

No, that's what the headline implies, and the body of the article doesn't support at all. It's (currently, and with no indication of intent to change this) two separate branches of their business.

numeri · 2026-02-25T15:24:01 1772033041

but Taalas had to quantize Llama 3.1 8B to death to get it to fit. It can't produce coherent non-English text at all.

numeri · 2026-02-16T17:08:20 1771261700

and if I was to guess, the latest generation of models (Claude Opus 4.6, GPT-5.3-codex, etc.) differ from Opus 4.5, GPT 5.2 primarily in the addition of deeper, more difficult (most likely agentic and coding-based, like Terminal Bench) tasks to their RLVR training.

I could be completely off, as my intuition here is fully based on public research papers, but it seems to explain the current state of things fairly well.