Data Augmentations for Data-Constrained Language Model Pretraining

📰 ArXiv cs.AI

arXiv:2606.16246v1 Announce Type: cross Abstract: As AI labs approach a data ceiling where compute capacity outpaces the rate of new high-quality text generation, language model pretraining is shifting toward a data-constrained, compute-abundant regime that demands productive multi-epoch training on fixed corpora. Standard autoregressive (AR) pretraining overfits severely in this setting, reaching its optimum early and then continuously deteriorating. We investigate data augmentation as a regula

Published 16 Jun 2026
Read full paper → ← Back to Reads