It's actually a reasonable way to think about alignment. Sometimes you want the ...

It's actually a reasonable way to think about alignment. Sometimes you want the agent to just listen to you and sometimes you want the agent to think critically.

I think about this line a lot. For example, as it happens sometimes you'll have a typo in something you want the agent to do. Llms typically will correct that typo silently and implement the actually intended thing. But if you said, "no, I want the thing I typed," I think everyone's expectation is that is says, "ok done."

I've found that leaving clues in the system prompt / exchange that are open to critique largely mitigate sycophancy with most recent models.

As engineers were trained to represent our positions strongly. Strong opinions loosely held, etc. when you speak authoritatively to a person, "I think we should do x...", the person understand that that's just you're opinion and have the autonomy to push back.

An llm imo _shouldnt_ have that same kind of autonomy by default and it should be rlhf'ed out.