Can Small Training Runs Reliably Guide Data Curation? Rethinking Proxy-Model Practice

Jiachen T. Wang; Tong Wu; Kaifeng Lyu; James Zou; Dawn Song; Ruoxi Jia; Prateek Mittal

Can Small Training Runs Reliably Guide Data Curation? Rethinking Proxy-Model Practice

Jiachen T. Wang, Tong Wu, Kaifeng Lyu, James Zou, Dawn Song, Ruoxi Jia, Prateek Mittal

Published: 26 Jan 2026, Last Modified: 11 Feb 2026ICLR 2026 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: proxy models, data curation

TL;DR: We propose using very small learning rates for proxy models to better preserve the relative performance rankings that would be obtained with optimally-tuned large-scale training.

Abstract: Data teams at frontier AI companies routinely train small proxy models to make critical decisions about pretraining data recipes for full-scale training. However, the community has a limited understanding of whether and when conclusions drawn from small-scale experiments reliably transfer to large-scale production training. In this work, we uncover a critical issue in the standard practice of training small proxy models on each data recipe with a single set of hyperparameters. We demonstrate that each dataset requires its own optimal training configuration, and that dataset rankings can completely reverse with even minor adjustments to proxy training hyperparameters. Furthermore, this creates a disconnect from the actual model development pipeline, where hyperparameter optimization is a standard step. Consequently, we propose that the objective of data selection should be to identify the dataset that yields the best performance after its own hyperparameter optimization. We introduce a simple yet effective patch to the current proxy-model-based method: training proxy models with sufficiently small learning rates produces dataset rankings that strongly correlate with those obtained when large-scale models are properly tuned for each dataset. Theoretically, we prove that, for random-feature models, this approach preserves the ordering of datasets according to their optimal achievable losses. Empirically, we validate this approach through comprehensive experiments across 23 data recipes covering four critical dimensions of data curation decisions faced in production settings, demonstrating dramatic improvements in proxy model reliability.

Primary Area: interpretability and explainable AI

Submission Number: 22841

Loading