MixAtlas: Uncertainty-aware Data Mixture for Multimodal LLM Midtraining

Published: 02 Mar 2026, Last Modified: 02 Apr 2026ICLR 2026 Workshop DATA-FMEveryoneRevisionsCC BY 4.0
Keywords: Multimodal, midtraining, data mixture, uncertainty, interpretability
Abstract: Domain reweighting can improve sample efficiency and downstream generalization; however, data-mixture optimization for multimodal midtraining remains underexplored. Current multimodal training recipes tune mixtures from only a single perspective such as data format or task type. We introduce MixAtlas, which produces a benchmark-targeted data recipe that users can inspect, adapt, and transfer to their own corpora and downstream goals. MixAtlas, curates the training image corpus along two interpretable axes---\emph{image concepts} and \emph{task supervision}---enabling interpretable mixture control and fine-grained attribution of downstream performance to specific domains within each axis. Using small proxy models and a Gaussian-process surrogate, we show that the optimal mixtures obtained successfully transfer to larger-scale model. The resulting mixtures yield substantial improvements: up to 2$\times$ faster convergence and consistently average gains from 1\%--17.6\% on 10 diverse benchmarks compare with strongest data mixture baselines. Overall, \framework makes multimodal mixture optimization interpretable and adaptable, providing concrete, compute-efficient data recipes for training next-generation MLLMs.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 156
Loading