Evaluating Sample Utility for Efficient Data Selection by Mimicking Model Weights

📰 ArXiv cs.AI

arXiv:2501.06708v5 Announce Type: replace-cross Abstract: Large-scale web-crawled datasets contain noise, bias, and irrelevant information, necessitating data selection techniques. Existing methods depend on hand-crafted heuristics, downstream datasets, or require expensive influence-based computations -- all of which limit scalability and introduce unwanted data dependencies. To address this, we introduce the Mimic Score, a simple and geometry-based data-quality metric that evaluates utility by

Published 27 May 2026

Read full paper → ← Back to Reads