There are no real benchmarks of how "natural/idiomatic" output is in a multitude of languages.
"Multilingual benchmarks" are usually something like "How good is it at a multiple choice exam like the SAT in language X". This is a completely unrelated metric.
"Multilingual benchmarks" are usually something like "How good is it at a multiple choice exam like the SAT in language X". This is a completely unrelated metric.