Keywords: training re-evaluation curve, data curriculum / data placement, large language model (LLM) pre-training, AdamW EMA timescale, learning-rate schedules, tokens-per-parameter ratio
TL;DR: We evaluate fully-trained LLMs on their original training data, measuring retention across steps; a predictive model of the resulting "re-evaluation curve" identifies optimal spots for high-quality data, surpassing default end-of-training placement.
Abstract: Data curriculums have become central to successful LLM training, yet principles governing optimal data placement remain unclear. We introduce the *training re-evaluation curve (TREC)*, a diagnostic that retrospectively evaluates training batches *using the final model weights*. The TREC characterizes how well a trained model retains training data as a function of *when* the data was encountered during training. Analyzing TRECs for models from 111M to 3.9B parameters, we show that placing high-quality data at low points on the TREC significantly improves performance. Importantly, while a TREC is initially observable only after training, we demonstrate it can be *predicted in advance* from AdamW’s implicit EMA coefficients, enabling proactive curriculum design. By predicting TRECs for published training recipes, we explain prior ablations and reveal suboptimal data placements. We also align high-quality data with TREC minima in order to improve continual pre-training of a 3.9B-parameter LLM trained on 900B tokens.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 13099
Loading