Spokes: Optimizing for Diverse Pretraining Data Selection

📰 ArXiv cs.AI

arXiv:2606.15216v1 Announce Type: cross Abstract: Diversity plays a critical role in data selection, improving performance under fixed data budgets by reducing redundancy and repetition. However, optimizing for diversity is inherently challenging, as it is a set-level property that depends on interactions between data points rather than individual examples. As a result, existing approaches typically rely on proxies or approximations, which often fail to ensure sufficiently diverse subsets. In th

Published 16 Jun 2026
Read full paper → ← Back to Reads