Less Data, Faster Training: sampling bias from small dataset can speed up training

Published: 02 Mar 2026, Last Modified: 15 May 2026Sci4DL 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: data repetition, parity, single-index model, training efficiency
TL;DR: Sampling biases from small datasets effectively adjust relative layer norm growth and can hence help accelerate training in terms of compute.
Abstract: This work investigates the ``small-vs-large gap'', where repeating on _fewer samples_ can lead to _compute saving_ during training compared to using a larger dataset. This is observed across algorithmic tasks, architectures and optimizers and cannot be explained using prior theory. We argue that the speedup comes from appropriate layer-wise growth enabled by _sampling biases_, which is more pronounced when the dataset size is smaller. We provide both theoretical analysis and empirical evidence from various interventions. Our results show that using a smaller dataset with more repetitions is not just a fallback strategy under data scarcity, but can be proactively leveraged as a favorable inductive biases for optimization, particularly in reasoning tasks.
Anonymization: This submission has been anonymized for double-blind review via the removal of identifying information such as names, affiliations, and identifying URLs.
Style Files: I have used the style files.
Submission Number: 49
Loading