Data Value in the Age of Scaling: Understanding LLM Scaling Dynamics Under Real–Synthetic Data Mixtures

Haohui Wang; Jingyuan Qi; Jianpeng Chen; Jun Wu; Lifu Huang; Lecheng Zheng; Kevin Choi; Balaji Veeramani; Edward Bowen; Alison Hu; Tyler Cody; Dawei Zhou

Data Value in the Age of Scaling: Understanding LLM Scaling Dynamics Under Real–Synthetic Data Mixtures

Haohui Wang, Jingyuan Qi, Jianpeng Chen, Jun Wu, Lifu Huang, Lecheng Zheng, Kevin Choi, Balaji Veeramani, Edward Bowen, Alison Hu, Tyler Cody, Dawei Zhou

18 Sept 2025 (modified: 19 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Data Valuation, LLM, Scaling Dynamics

Abstract: The rapid progress of large language models (LLMs) is fueled by the growing reliance on datasets that blend real and synthetic data. While synthetic data offers scalability and cost-efficiency, it often introduces systematic distributional discrepancies, particularly underrepresenting long-tail knowledge due to truncation effects from data generation mechanisms like top-$p$ sampling, temperature scaling, and finite sampling. These discrepancies pose fundamental challenges in characterizing and evaluating the utility of mixed real-synthetic datasets. In this paper, we identify a three-phase scaling behavior characterized by two breakpoints that reflect transitions in model behavior across learning head and tail knowledge. We further derive an LLM generalization bound designed for real and synthetic mixtures, revealing several key factors that govern their generalization performance. Building on our theoretical findings, we propose an effective yet efficient data valuation method that scales to large-scale datasets. Comprehensive experiments across four tasks, including image classification, sentiment classification, instruction following, and complex reasoning, demonstrate that our method surpasses state-of-the-art baselines in data valuation with significantly low computational cost.

Primary Area: interpretability and explainable AI

Submission Number: 13262

Loading