How Text Quality Interventions Reshape Neural Scaling Laws for LLMs: Empirical Study

How Text Quality Interventions Reshape Neural Scaling Laws for LLMs: Empirical Study

ICLR 2026 Conference Submission20595 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Neural Scaling law, Text quality

TL;DR: We present the first large-scale empirical study showing how text quality interventions reshape neural scaling laws and compute-optimal strategies for training LLMs, highlighting the need to rank data strategies using scaling law curves.

Abstract: Neural scaling laws are widely used for performance projection and resource planning, yet their sensitivity to data quality interventions remains poorly understood. We present an empirical study of how interventions—deduplication, heuristic filtering, and LLM-guided rewriting—reshape scaling behavior in large language model training. Using QualityPajama, a suite of 23 systematically filtered and synthetic datasets, we train over 2,000 models (100M–8B parameters, 100M–200B tokens) to measure how data quality affects scaling-law parameters and compute-optimal design decisions. Our results show that data interventions reshape scaling dynamics in non-trivial ways not captured by current theory, simultaneously moving exponents, coefficients, and constants in conflicting directions that exert opposing forces on loss. For example, an intervention may improve constants but hurt the exponents. Strategies that appear optimal at small scale can reverse at larger scale, and compute-optimal token–parameter ratios can vary by orders of magnitude depending on the intervention. These findings demonstrate that data curation and scaling strategy are deeply intertwined, and that evaluating interventions only at fixed scales can lead to misleading conclusions. We recommend evaluating interventions through their full scaling trajectories using scaling law projections.

Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning

Submission Number: 20595

Loading