Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

https://www.lakera.ai/blog/claude-4-sonnet-a-new-standard-fo...

These LLMs still fall short on a bunch of pretty simple tasks. Attackers can get Claude 4 to deny legitimate requests easily by manipulating third party data sources for example.



They gave a bullet point in that intro which I disagree with: "The only way to make GenAI applications secure is through vulnerability scanning and guardrail protections."

I still don't see guardrails and scanning as effective ways to prevent malicious attackers. They can't get to 100% effective, at which point a sufficiently motivated attacker is going to find a way through.

I'm hoping someone implements a version of the CaMeL paper - that solution seems much more credible to me. https://simonwillison.net/2025/Apr/11/camel/


I only half understand CaMeL. Couldn't the prompt injection just happen at the stage where the P-LLM devises the plan for the other LLM such that it creates a different, malicious plan?

Or is it more about the user then having to confirm/verify certain actions and what is essentially a "permission system" for what the LLM can do?

My immediate thought is that that may be circumvented in a way where the user unknowingly thinks they are confirming something safe. Analogous to spam websites that show a fake "Allow Notifications" prompt that is rendered as part of the actual website body. If the P-LLM creates the plan it could make it arbitrarily complex and confusing for the user, allowing something malicious to happen.

Overall it's very good to see research in this area though (also seems very interesting and fun).


The idea is that the P-LLM is never exposed to interested data.


Agreed on CaMeL as a promising direction forward. Guardrails may not get 100% of the way but are key for defense in depth, even approached like CaMeL currently fall short for text to text attacks, or more e2e agentic systems.


What security measure, in any domain, is 100% effective?


Using parameters in your SQL query in place of string concatenation to avoid SQL injection.

Correctly escaping untrusted markup in your HTML to avoid XSS attacks.

Both of those are 100% effective... unless you make a mistake in applying those fixes.

That is why prompt injection is different: we do not know what the 100% reliable fixes for it are.


Fair point - "the only way to" is probably too strong a framing. But I think the core argument stands: while model-level safety improvements are valuable, they're not sufficient for securing real applications. Claude is clearly the safest model available right now, but it's still highly susceptible to indirect prompt injection attacks and remains practically unaligned when it comes to tool use. The safety work at the model level helps with direct adversarial prompts, but doesn't solve the fundamental architectural vulnerabilities that emerge when you connect these models to external data sources and tools - for now.


None; but, as mentioned in the post, 99% is considered a failing grade in application security.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: