Can Small Training Runs Reliably Guide Data Curation? Rethinking Proxy-Model Practice

📰 ArXiv cs.AI

arXiv:2512.24503v2 Announce Type: replace-cross Abstract: Data teams at frontier AI companies routinely train small proxy models to make critical decisions about pretraining data recipes for full-scale training runs. However, the community has a limited understanding of whether and when conclusions drawn from small-scale experiments reliably transfer to full-scale model training. In this work, we uncover a subtle yet critical issue in the standard experimental protocol for data recipe assessment

Published 14 Apr 2026

Read full paper → ← Back to Reads