Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> The SWE-Bench result carries a bit more weight

Although I have issues with it (few benchmarks are perfect), I tend to agree. Gemini's 63.8 from Sonnet's 62.3 isn't a huge jump though. To Gemini's credit, it solved a bug in my PyTorch code yesterday that o1 (through the web app) couldn't (or at least didn't with my prompts).



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: