Detecting Data Contamination in Large Language Models

📰 ArXiv cs.AI

arXiv:2604.19561v1 Announce Type: new Abstract: Large Language Models (LLMs) utilize large amounts of data for their training, some of which may come from copyrighted sources. Membership Inference Attacks (MIA) aim to detect those documents and whether they have been included in the training corpora of the LLMs. The black-box MIAs require a significant amount of data manipulation; therefore, their comparison is often challenging. We study state-of-the-art (SOTA) MIAs under the black-box assumpti

Published 22 Apr 2026

Read full paper → ← Back to Reads