Keywords: Data mixture, Vision-language model
TL;DR: We introduce a systematic data-mixing framework for vision-language models by deriving multi-modal alignment scores that can further handle missing modalities.
Abstract: Vision-Language models (VLMs) are typically trained on a diverse set of multi-modal domains, yet current practices rely on costly manual tuning. This paper introduces MMix, a principled framework for automatically determining multi-modal data mixtures for VLM training. We formulate this task as a modality-aware alignment maximization over domains, deriving multi-modal alignment scores from the dual solution through inter-modal coupling variables. Our method is crucially designed to handle domains with missing modalities, allowing for the systematic integration of language-only domains. In experiments on both 0.5B and 7B VLMs, MMix boosts accuracies on diverse evaluation benchmarks with marginal computational cost. Remarkably, it matches the expert-tuned performance 1.28$\times$ faster in image-text tuning and extends to more complex multi-modal video scenarios outperforming uniform weights performance with only 33\% steps.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 21074
Loading