Regarding Anthropic, they used to make best multilingual and generalist models, it's their policy thing, not a capability issue. Claude 3 was best at this, including dead and low-resource languages. Neither modern Claude nor Gemini are remotely close to what Claude 3 was capable of (e.g. zero-shot writing styles). Anthropic basically reversed their "character training" policy and started optimizing their models for code generation at the cost of everything else, starting with Sonnet 3.5. Claude 4 took a huge hit in multilingual ability
GPT, on the other hand, was always terrible at languages, except for the short-lived gpt-4.5-preview.
All modern models including Gemini have bugs in basic language coherency - random language switching, self-correction attempts resulting in hallucinations etc. I speculate it's a problem with heavy RL with rewards and policies not optimized for creative writing.
I've never ever had Gemini over the API switch languages in translation tasks and that's across more than 10 language pairs and 6 figures of calls, across both short and long outputs. Maybe your languages are even lower resource ones, though we do include Central Asian languages.
The Chinese models are very prone to it, they love to mix them up.
I've seen it in chat, but IMO that's more of a system prompt/harness issue.
I'll admit I don't remember Claude 3, the oldest data I have seems to be 3.5. And at that time Gemini 1.5 Pro did a much better job across all of our language pairs, it wasn't close.
This always bothers me because models will almost never see text that is mostly English with a little other language in training data (opposite happens of course) and certainly not in RL data. Why do they occasionally language switch?
That's the thing with benchmarks, without evals and actual hands-on experience they can give you false confidence. Claude now sounds almost clinical, and is unable to speak in different styles as easily. Claude 4+ uses a lot more constructions borrowed from English than Claude 3, especially in Slavic languages where they sound unnatural. And most modern models eventually glitch out in longer texts, spitting a few garbage tokens in a random language (Telugu, Georgian, Ukrainian, totally unrelated), then continuing in the main language like nothing happened. It's rare but it happens. Samplers do not help with this, you need a second run to spellcheck it. This wasn't a problem in older models, it's a widespread issue that roughly correlates with the introduction of reasoning. Another new failure mode is self-correction in complicated texts that need reading comprehension: if the model hallucinates an incorrect fact and spots it, it tries to justify or explain it immediately. Which is much more awkward than leaving it incorrect, and also those hallucinations are more common now (maybe because the model learns to make those mistakes together with the correction? I don't know.)
Btw samplers do in fact help with this. Random tokens deep in your output context are due to accumulated sampling errors from using shit samplers like top_p and top_k with temperature.
Use a full distribution aware sampler like p-less decoding, top-H, or top-n sigma, and this goes away
Yes the paper for this will be up for review at NeurIPS this year.
Basically everyone who experiments with creative writing is keenly aware of that (e.g. roleplayers), it's just the devs that have the experience training the models for it (Anthropic, DeepMind) aren't bothered doing this anymore since there's no money in it.
>this seems like something that should be capturable in a multi-lingual benchmark
Creative writing benchmarks just don't have good objectives to measure against. In particular, valid but inauthentic language constructions can't be captured well if your LLM judge lacks fidelity to capture it to begin with. Which is I think what typically happens.
An easy litmus test would be making a selected character in a story speak Ebonics or Haitian Creole or TikTok. Claude 3 Opus was light years ahead of any model in authenticity in using them, and it was immediately obvious in a side-by-side comparison with any model including Claude 3.5+. Nuances of Polish or Russian profanities/mat or British obscenities are always the hardest for any model (they tend to either swear like dockers or tone it down, lacking the eloquence), but Opus 3 was also ahead in any of those.
There are no real benchmarks of how "natural/idiomatic" output is in a multitude of languages.
"Multilingual benchmarks" are usually something like "How good is it at a multiple choice exam like the SAT in language X". This is a completely unrelated metric.
GPT, on the other hand, was always terrible at languages, except for the short-lived gpt-4.5-preview.
All modern models including Gemini have bugs in basic language coherency - random language switching, self-correction attempts resulting in hallucinations etc. I speculate it's a problem with heavy RL with rewards and policies not optimized for creative writing.