Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I don't really get this. At this point, my limiting factor is not how quickly Claude can self-trudge through code. It's whether Claude is going to do the task correctly or not.

I need more mechanisms for controlling long-running sessions and dynamically injecting my thoughts, correction, and nudges rather than faster ways to burn through my tokens without knowing if the results are going to be correct.

 help



I think the theoretical answer here is this:

"Agents address the problem from independent angles, other agents try to refute what they found, and the run keeps iterating until the answers converge."

So you will be supplying the "ground truth" (test suite, detailed spec, whatever) and empower an agent to use it to guide the other agents. Currently a lot of people do this sequentially in the form of multiple code-review passes by fresh agent sessions looking at the work of previous sessions.

Adversarial models are a longstanding technique in ML so it makes sense they would try to go this way.


I don't know, maybe I'm doing it wrong but I feel LLMs add a slop debt, and each agent pass just exuberates it.

Like I had an LLM implement a spec and said it was done... Except it had a ton of `casts` everywhere. Okay, my bad, I should have been clear "NO CASTS", so I use the LLM to remove the casts, except it just kept making things more and more complicated and ugly.

It took me taking a break and having a shower thought to realize all the ugliness is because one type should have been broken up into 2, which would remove a ton of generics and code. But Claude never suggested that, it was always "we need at least one cast here, or we need 1000 LOC of generic factories". I tried multiple new sessions with various prompts too.

Maybe one day soon LLMs could pay off their own slop debt but at least right now I don't trust them to write code unseen.

Edit: Maybe the correct action should have been to delete everything and make it re-write everything from scratch with the clear "NO CASTS EVER" rule. But still the point is feels like having LLM clean up after an LLM doesn't work well enough to just have keep it in a loop and never look at what it does.


This matches my experience.

I've had to put a fair chunk of effort in to skills that will run deterministic mechanisms to unslop a codebase (cyclomatic complexity grading has been really helpful here) as invariably some amount of guidance around principles will be missed over time. I've found it does help, though. Certainly I'm getting overall better results from Flash and Sonnet over multiple runs for fairly modest token increases. GPT 5.5 less so, but that's because it scores better in a first pass. I won't really know until I gauge it at the end of my sub month which has been more cost efficient for me all things considered.


The problem is that we have an ever growing and large number of constraints, and not following even a single one means the result is sloppy.

I don’t see them fixing this any time soon, and thus human in the loop is a requirement to use these tools effectively. That is unless you love your slot machine dopamine rush enough to ignore quality gates and respect for your peers time.


I’m in a similar boat. I find that longer sessions will introduce “noise”. I have to be extremely explicit to avoid adding this noise, as it pollutes the future output of the models. Sometimes it’s innocuous, other times it can derail sessions as the 2nd or 3rd pass introduce even more of their own noise.

To me, it seems the models are inherently designed to do this. Creating more verbose output than input, generating plans introduce things I didn’t ask for, extras, more “defensive” code that makes sense at first but is completely unnecessary in practice… I find it exhausting, but it’s important to pare down the output / plans at each stage and trim the generated stuff that isn’t needed.


I've been reading writing Rust for a long while now, since before 1.0. I'm capable of critically evaluating Rust code. I'm also a happy Claude Code user, mostly for lightweight uses like generating scaffolding, prototyping, and debugging.

The pure LLM, no human intervention vibe-coded PRs on Bun since the vibe-rewrite to Rust contain the worst coding horrors I've seen in 20 years of programming.

Setting aside the quality of the change itself (I would have done it differently, for sure: it is pretty straightforward to build a safe abstraction out of this type), the utterly pointless "source-text consistency test" added here is easily the worst example of "test repeats implementation" I have seen in my career:

https://github.com/oven-sh/bun/pull/30728/files#diff-863477b...


Write a skill outlining your expectations of the code, put that skill into the pipeline, so that it can be included within your workflow.

Webdev here, but currently I have: - a skill where I outlined how the architecture of the system should look like, with guards (static analysis, architecture tests, linting) confirming that the code it generates adheres to standards

- a skill that tells it how tests should look like (use generators, write both feature / unit tests)

- a skill that tells it to generate docs from the code in a form of acceptance criteria (Given / When / Then)

- a skill that tells it to generate frontend uat tests + accompanying backend seeders given the AC

- a skill that tells it to verify that ticket objectives match what was delivered

At this point I still need to guide it to move task from one stage to the other (coding, testing, verification that indeed what was coded adheres to what was required), but I believe that these dynamic workflows can automate this work as well.


If you want hard rules, use deterministic tools. Prompts are for fuzzy guidance.

How would you prevent a junior engineer doing this mistake? Presumably, you would setup a lint rule. Do the same for LLMs. Run the linter after each edit through a hook, give feedback to the LLM. Write your lint rules with clear explanations of why the behavior is a problem, and nudges to the good behavior.

You wouldn't prevent the junior from making this mistake.

You would correct them once or twice, and they won't make the mistake again.

It's something we can't do with LLM's currently, so we all just try(and fail) to predict any possible failure ever, and then somehow try to cram it into the limited context.


> Currently a lot of people do this sequentially in the form of multiple code-review passes by fresh agent sessions looking at the work of previous sessions.

Up until now I've used a review loop approach, where within a Claude Code session I just tell it to spawn three review sub-agents, each with context of what's going on and instructions to look over all of the changed code in search for serious/critical issues, but otherwise a more fresh look at things. It works really well for the most part (token usage aside): https://news.ycombinator.com/item?id=48277011


This reminds me of something I do manually all the time (tell Claude to do something, ask another instance what the first one did wrong, etc), and it has given me good results. If the harness can now automate that, it seems great!

So, theoretically, in order to get claude to perform properly you just have to have a perfect spec up front, that anticipates all issues and contains no ambiguity?

Ground truth is not consensus, it has to be graded against what actually works for the original goal. Plenty of scenarios with AI and Humans can result in consensus around incorrectness.

While pedantically correct, I think the comment above assumed that you've correctly specified the work. If you can't correctly specify your work, AI agents are just going to help you get a non-solution faster.

Isn't coding the act of specificying the work to a processor? And AI agents are supposed to bridge the gap with intelligence from less specificed to more specified or possibly even more intelligent and alternate implementations?

What I meant by "ground truth" is that it is not fuzzy, not AI-evaluated, and not a consensus. The test suite passes or it doesn't. The codebase lints or it doesn't. The performance improved or it didn't.

An agent can help you create the specification, but it's up to you to know whether it's correctly testing that you got the result you wanted.


Yep. And yet, there's still some level of specification you have to do.

Doesn't help if the wrong design is implemented correctly.

Yes, that, accuracy, speed, and single computer-use.

I find those to be the limiting factors to speed.

I have extensive rules, I do extensive planning. Yet at implementation, the rules are not respected, errors are introduced, etc...

I spend more time fixing than writing code.

Then speed... Because of the fixes and bad code quality even with frontiers model speed makes a very big difference. I (agents) spend hours daily doing reviews and fixes. 5x speed boost would make me much more productive.

And when working super fast with agents, having only one computer is limiting. Even worktrees don't solve problems because I use things like convex, chrome use, etc... and it conflicts with each others all the time.

Still many problems to solve. It's already evolved so much in the last two years.


This is my experience. Quantity of output is not the issue right now. Quality is. But I’m not sure if this will ever be solved for, given LLMs are non-deterministic sophisticated autocomplete at their core.

Sure, ‘human in the loop’ and all that jazz, but I feel like my knowledge suffers even with this approach. I have to use llms w pinpoint focus to get decent results.

The original copilot completions behavior might be peak llm performance for coding, sans having an agent write boilerplate and such.


I have heard of "token-maxxing" but I have not heard of "correctness-maxxing" or "quality-maxxing".

Not with those exact terms, but it is certainly being discussed. Wes McKinney said in a recent talk that with current coding agents there’s no longer an excuse for shipping suboptimal code that takes on tech debt. Writing tests has never been cheaper, writing custom fuzzers, linters, and other harnesses that serve as guardrails has never been cheaper. His take is that “we didn’t have enough engineering time to do it right” is no longer an excuse, and the only excuses left are that you don’t know any better or you have bad taste.

A more interactive Claude code would be great instead of 50 “here’s a tiny snapshot of a change shorn of the context you need to understand it. Yes or no?”s

When this is all finished and done, these coding models will allow you to rewrite the linux kernel in rust, recode Kubernetes in assembly, and create your own web framework in 10 min.

But each prompt will cost your company, 10 to 15 million dollars. An extra 20 million if you ask them to review the code and improve the comments.


I think for now it's better to convert tokens into code/library code and then work with that for deterministic results rather than relying on Claude being correct or not.

The answer for me has been actually more tokens, and create even more layers of automated verification

yes I agree with this, more granular going back, letting me interrupt where it went off the rails, or even editing file reads myself etc would be lovely. Ingesting parts of other conversations would also be cool!

Dynamic workflows, in my experience, make Claude more effective at complex long-running tasks. They help precisely with getting Claude to do the task correctly.

It feels more like a bespoke build system for the specific task/project than prompting a freeform chat.


As long as agents are fuzzy (which they will continue to be with the Transformers architecture), the need to validate will continue to exist. I cannot imagine merging code without at least 1 human review.

I've used agents quite a bit and I agree.

The current baseline workflow is something like agent output -> human review -> agent refinement -> human review -> agent refinement -> ...

But agents are capable of making meaningful improvements to their own output. I'm hoping dynamic workflows move towards something like:

agent output -> agent review -> agent refinement -> (cycle to fixed point) -> final human review




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: