Sonnet 3.7 and O1 Pro both have 200K context windows. But O1 Pro has a 100K output window, and Sonnet 3.7 has a 128K output window. Point for Sonnet.
I routinely put about 100K + of context into Sonnet 3.7 in the form of source code, and in the Extended mode, given the right prompt, it will output perhaps 20 large source files before having to make a "continue" request (for example if it's asked to convert a web app from templates to React).
I'm curious whether O1 Pro actually exceeds Sonnet 3.7 in Extended mode for coding or not. Looking forward to seeing some benchmarks.
I am very curious how 3.7 and o1 pro perform in this regard:
> We evaluate 12 popular LLMs that claim to support contexts of at least 128K tokens. While they perform well in short contexts (<1K), performance degrades significantly as context length increases. At 32K, for instance, 10 models drop below 50% of their strong short-length baselines. Even GPT-4o, one of the top-performing exceptions, experiences a reduction from an almost-perfect baseline of 99.3% to 69.7%.
Anyone ever tries to restructure a 10K text? For example, structure a 45min - 1hr interview transcript in an organized way without losing any detailed numbers / facts / supporting evidence. I find that none of OpenAI's model is capable of this task. Models are trying to summarize and omitting details. I think such task does not require much intelligence, but surprisingly OpenAI's "large" context model cannot make it.
I routinely put about 100K + of context into Sonnet 3.7 in the form of source code, and in the Extended mode, given the right prompt, it will output perhaps 20 large source files before having to make a "continue" request (for example if it's asked to convert a web app from templates to React).
I'm curious whether O1 Pro actually exceeds Sonnet 3.7 in Extended mode for coding or not. Looking forward to seeing some benchmarks.