This can be dangerous, because Claude doesn't truly understand why it did something. Whatever it writes a post-hoc justification which may or may not be accurate to the "intent". This is because these are still autoregressive models --- they have only the context to go on, not prior intent.
Indeed. Watching it (well, Anthropic, really) cheat at Baba Is You and then try to give a rationalization for how it came up with the solution (qv. https://news.ycombinator.com/item?id=44473615) is quite instructive.