Keywords: Data repetition, repeated data, language model pretraining, deduplication, memorization, scaling laws, compute efficiency, compute-equivalent loss, Chinchilla scaling, generalization, misspecified linear regression
TL;DR: Repetition predictably wastes compute, with the worst harm at intermediate repeat counts.
Abstract: Language models are running out of high-quality training data, and even aggressively deduplicated corpora retain some amount of repetition. Earlier controlled studies predated Chinchilla-style scaling laws and could only measure the cost of repetition indirectly. We revisit repetition in the Chinchilla-style scaling regime, using a fitted no-repetition scaling law to report Compute-Equivalent Gain and Compute-Equivalent Loss. We show that repetition damage in this modernized regime is systematic in three ways. First, eval loss is worst at an intermediate repeat count $R$, so repeating a moderately sized subset many times hurts more than either repeating a large subset a few times or a small subset many times. Second, the location of this peak is well fit by a power law in model size. Finally, when repeated documents make up 10\% of training tokens in a controlled exact-document repetition setting, the compute-equivalent loss can be large: on FineWeb-Edu-Dedup, the most damaging repeat count for a Qwen3-style 344M-parameter model at $OT=1$ matches the loss of a no-repetition run using about 67\% of the FLOPs, under our fitted no-repetition scaling law. A misspecified linear regression with verbatim duplicates reproduces the same qualitative non-monotonicity in closed form, suggesting that such peaks can arise from a statistical tradeoff between memorization and generalization. Our findings give practitioners a way to predict which settings waste the most compute before they spend any of it.
Submission Number: 200
Loading