DataDecide: How to Predict Best Pretraining Data with Small Experiments

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0
TL;DR: When do pretraining experiments at small scales correctly predict the ranking of different data recipes at larger scales as measured by downstream performance?
Abstract: Because large language models are expensive to pretrain on different datasets, using smaller-scale experiments to decide on data is crucial for reducing costs. Which benchmarks and methods of making decisions from observed performance at small scale most accurately predict the datasets that yield the best large models? To empower open exploration of this question, we release models, data, and evaluations in DataDecide—the most extensive open suite of models over differences in data and scale. We conduct controlled pretraining experiments across 25 corpora with differing sources, deduplication, and filtering up to 100B tokens, model sizes up to 1B parameters, and 3 random seeds. We find that the ranking of models at a single, small size (e.g., 150M parameters) is a strong baseline for predicting best models at our larger target scale (1B) (∼ 80% of comparisons correct). No scaling law methods among 8 baselines exceed the compute-decision frontier of single-scale predictions, but DataDecide can measure improvement in future scaling laws. We also identify that using continuous likelihood metrics as proxies in small experiments makes benchmarks including MMLU, ARC, HellaSwag, MBPP, and HumanEval > 80% predictable at the target 1B scale with just 0.01% of the compute.
Lay Summary: Because large language models are expensive to pretrain on different datasets of text, experiments with smaller-scale models are used to decide what data to use in the large final model. But how do we know when small experiments help us make the right decisions or not? To help answer this, we release models, data, and evaluations in the DataDecide suite. This is the most extensive open suite of models over differences in data and scale. We find that which models do best at a single, small size is a good prediction for which are best at our larger target scale. Fitting a line to multiple small experiments doesn't do any better. DataDecide can be used to test improvements in future ways to make decisions with small experiments that we don't know about yet. For instance we already find that using different metrics for the small experiments can make commonly used benchmarks much easier to predict with much smaller experiments.
Link To Code: https://github.com/allenai/DataDecide
Primary Area: Deep Learning->Large Language Models
Keywords: Pretraining Data, Language Models, Scaling Laws, Evaluation, Benchmarks, Data Recipes, Data Ablations
Submission Number: 14500
Loading