I think that is an interesting observation and I generally agree.
Your point about prompting quality is very valid and for larger features I always use PRDs that are 5-20x the prompt.
The thing is my "experiment" is one that represents a fairly common use case: this feature is actually pretty small and embeds into an pre-existing UI structure - in a larger codebase.
GPT-5-Codex allows me to write a pretty quick & dirty prompt, yet still get VERY good results. It not only works on first try, Codex is reliably better at understanding the context and doing the things that are common and best practice in professional SWE projects.
If I want to get something comparable out of Claude, I would have to spend at least 20mins preparing the prompt. If not more.
> The thing is my "experiment" is one that represents a fairly common use case
Valid as well. I guess I'm just nitpicking based on how much I see people saying these models aren't useful combined with seeing this example, triggered my "you're doing it wrong" mode :D
> GPT-5-Codex allows me to write a pretty quick & dirty prompt, yet still get VERY good results.
I have a reputation with family and co-workers of being quite verbose - this might be why I prefer Claude (though haven't tried Codex in the last month or so). I'm typically setting up context and spending a few minutes writing an initial prompt and iterating/adjusting on the approach in planning mode so that I _can_ just walk away (or tab out) and let it do it's thing knowing that I've already reviewed it's approach and have a reasonable amount of confidence that it's taking an approach that seems logical.
I should start playing with codex again on some new projects I have in mind where I have an initial planning document with my notes on what I want it to do but nothing super specific - just to see what it can "one shot".
Yeah, as someone who has been using Claude Code for about 4 months now, I’ve adopted a “be super specific by default”-workflow. It works very well.
I typically use zen-mcp-server’s planning mode to scope out these tasks, refine and iterate on a plan, clear context, and then trigger the implementation.
There’s no way I would have considered “implement fuzzy search” a small feature request. I’m also paranoid about introducing technical debt / crappy code, as in my experience is the #1 reason that LLMs typically work well for new projects but start to degrade after a while: there’s just a lot of spaghetti and debt built up over time.
I tend to tell claude to research what is already there, and think hard, and that gives me much better per-prompt results.
But you are right that codex does that all by default. I just get frustrated when I ask it something simple and it spends half an hour researching code first.
Some do this by using tools like RepoPrompt to read entire files into GPT-5 Pro, and then using GPT-5 Pro to send the relevant context and work plan to Codex so that it can skip needing to poke around files. If you give it the context, it won't spend that time looking for it. But then you spend time with Pro (which can ingest entire files at once instead of searching through them, and provide a better plan for Codex, though)
It worked on the first try, but did it work on the second?
I noticed in conversations with LLMs, much of what they come up with is non-deterministic. You regenerate the message and it disappears.
That appears to be the basic operating principe of the current paradigm. And agentic programming repeats this dice roll, dozens or hundreds of times.
I don't know enough about statistics to say if that makes it better (converging on the averages?) or worse (context pollution, hallucinating, focusing on noise?), but it seems worth considering.
I would think that to truly rank such things, you should run a few tests and look for a clear pattern. It's possible that something promoted claude to take "the easy way" while chatgpt didn't.
Your point about prompting quality is very valid and for larger features I always use PRDs that are 5-20x the prompt.
The thing is my "experiment" is one that represents a fairly common use case: this feature is actually pretty small and embeds into an pre-existing UI structure - in a larger codebase.
GPT-5-Codex allows me to write a pretty quick & dirty prompt, yet still get VERY good results. It not only works on first try, Codex is reliably better at understanding the context and doing the things that are common and best practice in professional SWE projects.
If I want to get something comparable out of Claude, I would have to spend at least 20mins preparing the prompt. If not more.