MixMin: Finding Data Mixtures via Convex Minimization

Anvith Thudi; Evianne Rovers; Yangjun Ruan; Tristan Thrush; Chris J. Maddison

MixMin: Finding Data Mixtures via Convex Minimization

Anvith Thudi, Evianne Rovers, Yangjun Ruan, Tristan Thrush, Chris J. Maddison

Published: 01 May 2025, Last Modified: 23 Jul 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

TL;DR: We show that the bi-level optimization for data mixing reduces to a convex minimization as model classes become larger.

Abstract: Modern machine learning pipelines are increasingly combining and mixing data from diverse and disparate sources, e.g., pre-training large language models. Yet, finding the optimal data mixture is a challenging and open problem. We formalize this data mixing problem as a bi-level objective: the best mixture is the one that would lead to the best model for a downstream objective. Unfortunately, this objective is generally intractable. In this paper, we make the observation that the bi-level data mixing objective becomes convex as our model class becomes larger. We develop and study a gradient-based approach for optimizing this convex objective, which we call MixMin, and test it on language modeling and chemistry tasks. MixMin was the only method that uniformly improved the data mixture in all our experiments. With MixMin, we improved the data mixture using less than 0.2% additional compute for a pythia-$410M$ model trained on $8.2B$ tokens, resulting between 1-5% relative improvement to negative log likelihood on PIQA, ARC Easy, SciQ, and OpenWebMath. Crucially, we found that MixMin mixtures for smaller models improved training of larger models, suggesting that MixMin mixtures may be scale-invariant. When mixing bioassay data to train an XGBoost model, we saw improvements to average precision scores of $0.03-0.15$.

Lay Summary: Performant machine learning requires having a relevant dataset for the task you want to learn. When given many sources of data, the problem of knowing how to make a good dataset from these sources poses a typically hard optimization problem. In this paper we showed this optimization can be simplified if we first train a (cheap) model on each of our sources of data. With this we provided a method to make better datasets, leading to improvements on language modeling and chemistry tasks. Our work paves a way for finding useful datasets for typically data-scarce tasks.

Primary Area: Deep Learning->Algorithms

Keywords: Data Mixing, Bi-Level optimization, Convex Surrogates, Large Language Models, Foundation Models, Chemistry

Submission Number: 8015

Loading