Data filtering methods for training language models

📰 ArXiv cs.AI

arXiv:2605.29807v1 Announce Type: cross Abstract: Data quality is a critical factor in the effectiveness of machine learning models. Label errors, present even in widely used benchmarks, introduce noise into training data and reduce model generalization. In this work, we conduct a comparative analysis of two automatic label error detection methods - Confident Learning and Dataset Cartography - on three Russian text classification corpora of varying size, number of classes, and domain: ru_emotion

Published 29 May 2026

Read full paper → ← Back to Reads