DataRater: Meta-Learned Dataset Curation

Dan A. Calian; Gregory Farquhar; Iurii Kemaev; Luisa Zintgraf; Matteo Hessel; Jeremy Shar; Junhyuk Oh; András György; Tom Schaul; Jeff Dean; Hado van Hasselt; David Silver

DataRater: Meta-Learned Dataset Curation

Dan A. Calian, Gregory Farquhar, Iurii Kemaev, Luisa Zintgraf, Matteo Hessel, Jeremy Shar, Junhyuk Oh, András György, Tom Schaul, Jeff Dean, Hado van Hasselt, David Silver

Published: 18 Sept 2025, Last Modified: 29 Oct 2025NeurIPS 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: meta-learning, dataset curation, data rating, bilevel optimization

TL;DR: We introduce the "DataRater", a meta-learning approach to automatically learn the value of data and use it to improve the compute efficiency of training foundation models.

Abstract: The quality of foundation models depends heavily on their training data. Consequently, great efforts have been put into dataset curation. Yet most approaches rely on manual tuning of coarse-grained mixtures of large buckets of data, or filtering by hand-crafted heuristics. An approach that is ultimately more scalable (let alone more satisfying) is to \emph{learn} which data is actually valuable for training. This type of meta-learning could allow more sophisticated, fine-grained, and effective curation. Our proposed \emph{DataRater} is an instance of this idea. It estimates the value of training on any particular data point. This is done by meta-learning using `meta-gradients', with the objective of improving training efficiency on held out data. In extensive experiments across a range of model scales and datasets, we find that using our DataRater to filter data is highly effective, resulting in significantly improved compute efficiency.

Primary Area: Deep learning (e.g., architectures, generative models, optimization for deep networks, foundation models, LLMs)

Submission Number: 9384

Loading