These LLMs still fall short on a bunch of pretty simple tasks. Attackers can get Claude 4 to deny legitimate requests easily by manipulating third party data sources for example.
They gave a bullet point in that intro which I disagree with: "The only way to make GenAI applications secure is through vulnerability scanning and guardrail protections."
I still don't see guardrails and scanning as effective ways to prevent malicious attackers. They can't get to 100% effective, at which point a sufficiently motivated attacker is going to find a way through.
I only half understand CaMeL. Couldn't the prompt injection just happen at the stage where the P-LLM devises the plan for the other LLM such that it creates a different, malicious plan?
Or is it more about the user then having to confirm/verify certain actions and what is essentially a "permission system" for what the LLM can do?
My immediate thought is that that may be circumvented in a way where the user unknowingly thinks they are confirming something safe. Analogous to spam websites that show a fake "Allow Notifications" prompt that is rendered as part of the actual website body. If the P-LLM creates the plan it could make it arbitrarily complex and confusing for the user, allowing something malicious to happen.
Overall it's very good to see research in this area though (also seems very interesting and fun).
Agreed on CaMeL as a promising direction forward. Guardrails may not get 100% of the way but are key for defense in depth, even approached like CaMeL currently fall short for text to text attacks, or more e2e agentic systems.
Fair point - "the only way to" is probably too strong a framing. But I think the core argument stands: while model-level safety improvements are valuable, they're not sufficient for securing real applications.
Claude is clearly the safest model available right now, but it's still highly susceptible to indirect prompt injection attacks and remains practically unaligned when it comes to tool use. The safety work at the model level helps with direct adversarial prompts, but doesn't solve the fundamental architectural vulnerabilities that emerge when you connect these models to external data sources and tools - for now.
These LLMs still fall short on a bunch of pretty simple tasks. Attackers can get Claude 4 to deny legitimate requests easily by manipulating third party data sources for example.