Spokes: Optimizing for Diverse Pretraining Data Selection
📰 ArXiv cs.AI
arXiv:2606.15216v1 Announce Type: cross Abstract: Diversity plays a critical role in data selection, improving performance under fixed data budgets by reducing redundancy and repetition. However, optimizing for diversity is inherently challenging, as it is a set-level property that depends on interactions between data points rather than individual examples. As a result, existing approaches typically rely on proxies or approximations, which often fail to ensure sufficiently diverse subsets. In th
DeepCamp AI