Repetition over Diversity: High-Signal Data Filtering for Sample-Efficient German Language Modeling

📰 ArXiv cs.AI

arXiv:2604.28075v1 Announce Type: cross Abstract: Recent research has shown that filtering massive English web corpora into high-quality subsets significantly improves training efficiency. However, for high-resource non-English languages like German, French, or Japanese, aggressive filtering creates a strategic dilemma: should practitioners prioritize diversity by training once on large amounts of lightly filtered web data, or prioritize quality by strictly filtering for a high-quality core and

Published 1 May 2026
Read full paper → ← Back to Reads