LLMSurgeon: Diagnosing Data Mixture of Large Language Models

LLMSurgeon: Diagnosing Data Mixture of Large Language Models

ACL ARR 2026 January Submission594 Authors

23 Dec 2025 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Pre-training Data Composition, Data Transparency, Model Auditing, LLM

Abstract: The pretraining data mixture of Large Language Models (LLMs) act as their "digital DNA", fundamentally governing model behaviors. While existing Membership Inference Attacks (MIA) can detect whether individual training samples are included in the pretraining dataset, they fail to quantify the macroscopic distribution of domains. In this work, we formalize the Pretraining Data Mxiture Surgery (DMS) problem: inferring the domain-level distribution of an LLM's training corpus solely from its generated texts. We propose LLMSurgeon, a principled framework that models DMS as an inverse problem under the label-shift assumption. Unlike naive classification, LLMSurgeon utilizes a calibrated "soft" confusion matrix to rectify systematic classifier biases, accurately recovering the latent training prior. To rigorously evaluate this task, we introduce LLMScan, the first benchmark comprising open-source LLMs with transparent data recipes. Experiments demonstrate that LLMSurgeon significantly outperforms aggregation-based membership inference attack (MIA) baselines, recovering ground-truth mixtures with high fidelity. Our work offers a practical, post-hoc method for auditing the "digital DNA" of foundation models without accessing their training data.

Paper Type: Long

Research Area: Safety and Alignment in LLMs

Research Area Keywords: Pre-training Data Composition, Ethics Bias and Fairness, Data Transparency

Contribution Types: Model analysis & interpretability, Data analysis

Languages Studied: English

Submission Number: 594

Loading