Train Smarter, Not Longer: Memorization-Guided Data Reuse for Efficient LLM Training

Published: 02 Mar 2026, Last Modified: 02 Mar 2026ICLR 2026 Workshop DATA-FMEveryoneRevisionsCC BY 4.0
Keywords: Large Language Models, Memorization, LLM Data, Data Strategy, Training Strategy
TL;DR: We introduce the Memorization Window framework, characterizing when LLMs forget training data and when overfitting begins. It enables principled multi-epoch data reuse that improves performance far beyond current practice.
Abstract: The training paradigm of large language models has shifted from traditional one-pass training to multi-epoch training, as reasonable reuse of limited high-quality data can improve both model performance and sample efficiency. Meanwhile, excessive repetition introduces the risk of overfitting and diminishing returns. Determining when and how to reuse data effectively thus emerges as a natural but under-explored question. Through a novel observation of model's $\textit{Memorization Window}$ signals derived from loss retention dynamics and downstream evaluation scores, we propose $\textit{Memorization-guided Data Reuse}$, a training paradigm that adaptively determines $\textit{when}$ and $\textit{how}$ data should be reused, enabling principled decisions on the number of training epochs and the scheduling of data replays. Our preliminary experiments reveal a consistent memorization-driven regime: performance continues to improve with repetition far beyond current practice (e.g., the commonly cited four-epoch limit). While a full scheduler remains future work, these insights provide a foundation for memorization-aware training schedules, helping to determine reuse budgets and move toward training LLMs $\textit{smarter rather than longer}$ with limited high-quality data.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 117
Loading