> The SWE-Bench result carries a bit more weight Although I have issues with it ...

> The SWE-Bench result carries a bit more weight

Although I have issues with it (few benchmarks are perfect), I tend to agree. Gemini's 63.8 from Sonnet's 62.3 isn't a huge jump though. To Gemini's credit, it solved a bug in my PyTorch code yesterday that o1 (through the web app) couldn't (or at least didn't with my prompts).