Automatically Generating Hard Math Problems from Hypothesis-Driven Error Analysis

📰 ArXiv cs.AI

arXiv:2604.04386v1 Announce Type: new Abstract: Numerous math benchmarks exist to evaluate LLMs' mathematical capabilities. However, most involve extensive manual effort and are difficult to scale. Consequently, they cannot keep pace with LLM development or easily provide new instances to mitigate overfitting. Some researchers have proposed automatic benchmark generation methods, but few focus on identifying the specific math concepts and skills on which LLMs are error-prone, and most can only g

Published 7 Apr 2026

Read full paper → ← Back to News