Round-Trip Translation Reveals What Frontier Multilingual Benchmarks Miss

📰 ArXiv cs.AI

arXiv:2604.12911v1 Announce Type: cross Abstract: Multilingual benchmarks guide the development of frontier models. Yet multilingual evaluations reported by frontier models are structured similar to popular reasoning and knowledge benchmarks, but across many languages. We show such benchmarks, and consequently multilingual evaluations, measure mathematical reasoning and factual recall, not multilingual proficiency. For example, thinking variants dramatically outperform instruct variants on these

Published 15 Apr 2026

Read full paper → ← Back to Reads