Tracing the Roots: A Multi-Agent Framework for Uncovering Data Lineage in Post-Training LLMs

📰 ArXiv cs.AI

arXiv:2604.10480v1 Announce Type: new Abstract: Post-training data plays a pivotal role in shaping the capabilities of Large Language Models (LLMs), yet datasets are often treated as isolated artifacts, overlooking the systemic connections that underlie their evolution. To disentangle these complex relationships, we introduce the concept of \textbf{data lineage} to the LLM ecosystem and propose an automated multi-agent framework to reconstruct the evolutionary graph of dataset development. Throu

Published 14 Apr 2026

Read full paper → ← Back to Reads