It seems to me that even if AI technology were to freeze right now, one of the next moderately-sized advances in AI would come from better filtering of the input data. Remove the input data in which humanity teaches the AI to play games like this and the AI would be much less likely to play them.
I very carefully say "much less likely" and not "impossible" because with how these work, they'll still pick up subtle signals for these things anyhow. But, frankly, what do we expect from simply shoving Reddit probably more-or-less wholesale into the models? Yes, it has a lot of good data, but it also has rather a lot of behavior I'd like to cut out of my AI.
I hope someone out there is playing with using LLMs to vector-classify their input data, identifying things like the "passive-aggressive" portion of the resulting vector spaces, and trying to remove it from the input data entirely.
I think part of the problem is that you need a model to classify the data, which needs to be trained on data that wasn't classified (or a dramatically smaller set of human-classified data), so it's effectively impossible to escape this sort of input bias.
Tangentially, I'd be far from the first to point out that these LLMs are now polluting their own training data, which makes filtering simulatenously all the more important and impossible.
I very carefully say "much less likely" and not "impossible" because with how these work, they'll still pick up subtle signals for these things anyhow. But, frankly, what do we expect from simply shoving Reddit probably more-or-less wholesale into the models? Yes, it has a lot of good data, but it also has rather a lot of behavior I'd like to cut out of my AI.
I hope someone out there is playing with using LLMs to vector-classify their input data, identifying things like the "passive-aggressive" portion of the resulting vector spaces, and trying to remove it from the input data entirely.