Reasoning, Code, or Both? How Large Language Models Handle Variations in Math Questions

📰 ArXiv cs.AI

Learn how Large Language Models handle variations in math questions and how code execution methods can improve reasoning robustness

advanced Published 27 May 2026

Action Steps

Analyze the performance of LLMs on mathematical reasoning benchmarks
Modify math problems with simple changes like different names or numbers to test reasoning robustness
Implement code execution methods to generate and run Python code for math problems
Compare the performance of LLMs with and without code execution methods on modified math problems
Evaluate the effect of code execution methods on reasoning robustness across problem variations

Who Needs to Know This

Researchers and developers working on LLMs and math reasoning benchmarks can benefit from understanding the strengths and limitations of current models and the potential of code execution methods to improve robustness

Key Insight

💡 Code execution methods can improve the reasoning robustness of Large Language Models on mathematical reasoning benchmarks

Key Takeaways

Learn how Large Language Models handle variations in math questions and how code execution methods can improve reasoning robustness

Full Article

Title: Reasoning, Code, or Both? How Large Language Models Handle Variations in Math Questions

Abstract:
arXiv:2605.26414v1 Announce Type: new Abstract: Large Language Models (LLMs) achieve impressive accuracy on mathematical reasoning benchmarks, yet their performance drops when problems are modified with simple changes like different names or numbers. Code execution methods, which let models generate and run Python code instead of reasoning in natural language, have been proposed as a solution, but their effect on reasoning robustness (the ability to maintain accuracy across problem variations) h

Read full paper → ← Back to Reads