Any such examples are often "closely guarded secrets" to prevent them from being...

la_fayette · 2025-09-30T06:47:38 1759214858

Since such a database should evolve continuously, I wouldn't see that as a problem. The important thing is, that each example is somehow verifiable, in the form of a unmodifiable test setup. So the LLM provides a solution, which is executed against the test to verify. Something like ACID3 Tests... But sure it can be gamed somehow in probably all setups...

squigz · 2025-09-30T05:59:28 1759211968

This seems like a non-issue, unless I'm misunderstanding. If failures can be used to help game benchmarks, companies are doing so. They don't need us to avoid compiling such information, which would be helpful to actual users.

dns_snek · 2025-09-30T06:16:05 1759212965

People might want to use the same test scenario in the future to see how much the models have improved. We can't do that if the example gets scraped into the training data set.