MGSM-Pro: A Simple Strategy for Robust Multilingual Mathematical Reasoning Evaluation
📰 ArXiv cs.AI
arXiv:2601.21225v2 Announce Type: replace-cross Abstract: Large language models have made substantial progress in mathematical reasoning. However, benchmark development for multilingual evaluation has lagged behind English in both difficulty and recency. Recently, GSM-Symbolic showed a strong evidence of high variance when models are evaluated on different instantiations of the same question; however, the evaluation was conducted only in English. In this paper, we introduce MGSM-Pro, an extensio
DeepCamp AI