Programming with Data: Test-Driven Data Engineering for Self-Improving LLMs from Raw Corpora

📰 ArXiv cs.AI

arXiv:2604.24819v1 Announce Type: cross Abstract: Reliably transferring specialized human knowledge from text into large language models remains a fundamental challenge in artificial intelligence. Fine-tuning on domain corpora has enabled substantial capability gains, but the process operates without feedback: when a model fails on a domain task, there is no method to diagnose what is deficient in the training data, and the only recourse is to add more data indiscriminately. Here we show that wh

Published 29 Apr 2026
Read full paper → ← Back to Reads