What they don't mention is all the tooling, MCPs and other stuff they've added t...

Bjorkbat · 2025-09-30T06:41:20 1759214480

Yeah, I thought about that after I looked at the SWE-bench results. It doesn't make sense that the SWE results are barely an improvement yet somehow the model is a more significant improvement when it comes to long tasks. You'd expect a huge gain in one to translate to the other.

Unless the main area of improvement was tools and scaffolding rather than the model itself.