Multi-modal Data Mixtures for Vision-Language Model Training

Multi-modal Data Mixtures for Vision-Language Model Training

ICLR 2026 Conference Submission21074 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Data mixture, Vision-language model

TL;DR: We introduce a systematic data-mixing framework for vision-language models by deriving multi-modal alignment scores that can further handle missing modalities.

Abstract: Vision-Language models (VLMs) are typically trained on a diverse set of multi-modal domains, yet current practices rely on costly manual tuning. This paper introduces MMix, a principled framework for automatically determining multi-modal data mixtures for VLM training. We formulate this task as a modality-aware alignment maximization over domains, deriving multi-modal alignment scores from the dual solution through inter-modal coupling variables. Our method is crucially designed to handle domains with missing modalities, allowing for the systematic integration of language-only domains. In experiments on both 0.5B and 7B VLMs, MMix boosts accuracies on diverse evaluation benchmarks with marginal computational cost. Remarkably, it matches the expert-tuned performance 1.28$\times$ faster in image-text tuning and extends to more complex multi-modal video scenarios outperforming uniform weights performance with only 33\% steps.

Primary Area: foundation or frontier models, including LLMs

Submission Number: 21074

Loading