Data Augmentations for Data-Constrained Language Model Pretraining
📰 ArXiv cs.AI
arXiv:2606.16246v1 Announce Type: cross Abstract: As AI labs approach a data ceiling where compute capacity outpaces the rate of new high-quality text generation, language model pretraining is shifting toward a data-constrained, compute-abundant regime that demands productive multi-epoch training on fixed corpora. Standard autoregressive (AR) pretraining overfits severely in this setting, reaching its optimum early and then continuously deteriorating. We investigate data augmentation as a regula
DeepCamp AI