What they don't mention is all the tooling, MCPs and other stuff they've added to make this work. It's not 30 hours out of the box. It's probably heavily guard-railed, with a lot of validated plans, checklists and verification points they can check. It's similar to 'lab conditions', you won't get that output in real-world situations.
Yeah, I thought about that after I looked at the SWE-bench results. It doesn't make sense that the SWE results are barely an improvement yet somehow the model is a more significant improvement when it comes to long tasks. You'd expect a huge gain in one to translate to the other.
Unless the main area of improvement was tools and scaffolding rather than the model itself.