Keywords: Scaling Laws; Omni-modal Models; Data Mixture
TL;DR: ModalMix optimizes data mixture for large multimodal model training (modeling cross-modal interactions and compute dependence) to find optimal data ratios, enabling 1.4× faster convergence, 47% better average rank across 17 tasks.
Abstract: Training large multimodal models requires optimizing the data mixture to balance cross-modal synergies against finite computational resources.
However, existing heuristics for data mixing largely ignore the underlying cross-modal dynamics and their dependence on compute scaling.
In this work, we propose ModalMix, a framework that formalizes data mixture optimization by simultaneously modeling cross-modal interactions and compute-dependent scaling laws. ModalMix yields a predictive regressor for the optimal data mixture at any given computational budget.
Empirically, models trained with ModalMix achieve 1.4× faster convergence than those with a uniform data distribution, alongside a 47\% better average rank over 17 downstream tasks.
The framework reveals that the optimal strategy is dynamic, not static: it initially prioritizes speech, then gradually shifts focus towards image-text data as compute increases, while maintaining stable use of text data.
ModalMix offers a flexible and principled solution to the data-mixing problem, bridging a critical gap between scaling theory and practical multimodal pretraining.
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Submission Number: 10417
Loading