Regarding Anthropic, they used to make best multilingual and generalist models, ...

deaux · 2026-04-23T22:56:30 1776984990

I've never ever had Gemini over the API switch languages in translation tasks and that's across more than 10 language pairs and 6 figures of calls, across both short and long outputs. Maybe your languages are even lower resource ones, though we do include Central Asian languages.

The Chinese models are very prone to it, they love to mix them up.

I've seen it in chat, but IMO that's more of a system prompt/harness issue.

I'll admit I don't remember Claude 3, the oldest data I have seems to be 3.5. And at that time Gemini 1.5 Pro did a much better job across all of our language pairs, it wasn't close.

rao-v · 2026-04-23T02:38:22 1776911902

This always bothers me because models will almost never see text that is mostly English with a little other language in training data (opposite happens of course) and certainly not in RL data. Why do they occasionally language switch?

awongh · 2026-04-22T20:46:03 1776890763

The benchmarks don’t seem to say that language ability has gotten worse?

orbital-decay · 2026-04-22T21:25:30 1776893130

That's the thing with benchmarks, without evals and actual hands-on experience they can give you false confidence. Claude now sounds almost clinical, and is unable to speak in different styles as easily. Claude 4+ uses a lot more constructions borrowed from English than Claude 3, especially in Slavic languages where they sound unnatural. And most modern models eventually glitch out in longer texts, spitting a few garbage tokens in a random language (Telugu, Georgian, Ukrainian, totally unrelated), then continuing in the main language like nothing happened. It's rare but it happens. Samplers do not help with this, you need a second run to spellcheck it. This wasn't a problem in older models, it's a widespread issue that roughly correlates with the introduction of reasoning. Another new failure mode is self-correction in complicated texts that need reading comprehension: if the model hallucinates an incorrect fact and spots it, it tries to justify or explain it immediately. Which is much more awkward than leaving it incorrect, and also those hallucinations are more common now (maybe because the model learns to make those mistakes together with the correction? I don't know.)

Der_Einzige · 2026-04-23T03:24:27 1776914667

Btw samplers do in fact help with this. Random tokens deep in your output context are due to accumulated sampling errors from using shit samplers like top_p and top_k with temperature.

Use a full distribution aware sampler like p-less decoding, top-H, or top-n sigma, and this goes away

Yes the paper for this will be up for review at NeurIPS this year.

awongh · 2026-04-22T21:45:24 1776894324

Not disputing this might be true, but this seems like something that should be capturable in a multi-lingual benchmark.

Maybe it's just something that people aren't bothered with?

orbital-decay · 2026-04-22T22:29:27 1776896967

Basically everyone who experiments with creative writing is keenly aware of that (e.g. roleplayers), it's just the devs that have the experience training the models for it (Anthropic, DeepMind) aren't bothered doing this anymore since there's no money in it.

>this seems like something that should be capturable in a multi-lingual benchmark

Creative writing benchmarks just don't have good objectives to measure against. In particular, valid but inauthentic language constructions can't be captured well if your LLM judge lacks fidelity to capture it to begin with. Which is I think what typically happens.

An easy litmus test would be making a selected character in a story speak Ebonics or Haitian Creole or TikTok. Claude 3 Opus was light years ahead of any model in authenticity in using them, and it was immediately obvious in a side-by-side comparison with any model including Claude 3.5+. Nuances of Polish or Russian profanities/mat or British obscenities are always the hardest for any model (they tend to either swear like dockers or tone it down, lacking the eloquence), but Opus 3 was also ahead in any of those.

deaux · 2026-04-23T22:49:37 1776984577

There are no real benchmarks of how "natural/idiomatic" output is in a multitude of languages.

"Multilingual benchmarks" are usually something like "How good is it at a multiple choice exam like the SAT in language X". This is a completely unrelated metric.

awongh · 2026-04-24T14:26:08 1777040768

then there should be such a benchmark :)