Unless you're looking at something like a pass@100 benchmark, the benchmarks are...

make3 · 2026-04-22T04:41:10 1776832870

The pass@100 is such a weird critique angle that is surprisingly mainstream; guess what, no one cares if the correct answer is in the top 100, it needs to be the top 1. A model with a better answer in the top 1 is a better model, full stop.

mentalgear · 2026-04-21T21:06:24 1776805584

This. Plus if you want to even attempt measuring real 'intelligence' you want to run a neuro-symbolic, de-lexicalized benchmark (e.g. DL-ReasonSuite, SoLT, GSM-Symbolic) - which none of the providers releasing new models showcase.