Internal Data Repetition Destroys Language Models

Published: 25 May 2026, Last Modified: 25 May 2026CTB@ICML 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Repeated Data, Generalization, Scaling Laws, Compute Efficiency, Memorization
TL;DR: We study how repeated pretraining data affects foundation-model performance and generalization, combining scaling experiments with a simple statistical model of repetition damage.
Abstract: Language models are running out of high-quality training data, and even aggressively deduplicated corpora retain some amount of repetition. Earlier controlled studies predated Chinchilla-style scaling laws and could only measure the cost of repetition indirectly. We revisit repetition in the Chinchilla-style scaling regime, using a fitted no-repetition scaling law to report Compute-Equivalent Gain and Compute-Equivalent Loss. We show that repetition damage in this modernized regime is systematic in three ways. First, eval loss is worst at an intermediate repeat count $R$, so repeating a moderately sized subset many times hurts more than either repeating a large subset a few times or a small subset many times. Second, the location of this peak is well fit by a power law in model size. Finally, when repeated documents make up 10% of training tokens in a controlled exact-document repetition setting, the compute-equivalent loss can be large: on FineWeb-Edu-Dedup, the most damaging repeat count for a Qwen3-style 344M-parameter model at $OT=1$ matches the loss of a no-repetition run using about 67% of the FLOPs, under our fitted no-repetition scaling law. A misspecified linear regression with verbatim duplicates reproduces the same qualitative non-monotonicity in closed form, suggesting that such peaks can arise from a statistical tradeoff between memorization and generalization. Our findings give practitioners a way to predict which settings waste the most compute before they spend any of it.
Paper Type: Long (8 pages)
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 148
Loading