Sonnet 3.7 and O1 Pro both have 200K context windows. But O1 Pro has a 100K outp...

consumer451 · on March 20, 2025

I am very curious how 3.7 and o1 pro perform in this regard:

> We evaluate 12 popular LLMs that claim to support contexts of at least 128K tokens. While they perform well in short contexts (<1K), performance degrades significantly as context length increases. At 32K, for instance, 10 models drop below 50% of their strong short-length baselines. Even GPT-4o, one of the top-performing exceptions, experiences a reduction from an almost-perfect baseline of 99.3% to 69.7%.

https://arxiv.org/abs/2502.05167

futopy · on March 23, 2025

Anyone ever tries to restructure a 10K text? For example, structure a 45min - 1hr interview transcript in an organized way without losing any detailed numbers / facts / supporting evidence. I find that none of OpenAI's model is capable of this task. Models are trying to summarize and omitting details. I think such task does not require much intelligence, but surprisingly OpenAI's "large" context model cannot make it.

qeternity · on March 20, 2025

"Usable" is the key word here. Not all context is created equal.

Have a look at the RULER benchmark for a bit more detail.