Keywords: instruction selection, multimodality, vision-language model, data selection, instruction-tuning, SFT
Abstract: Most existing instruction selection methods in vision-language learning rely on sample embeddings to guide data choice. These embeddings are typically derived from pure vision encoders or small multimodal models and they primarily capture \emph{visual concepts} while under-representing \emph{visual skills} such as counting, spatial reasoning, or commonsense inference. This imbalance overlooks a key distinction: multimodal benchmarks vary widely in whether they emphasize conceptual grounding or skill-based reasoning. We show that this concept--skill axis provides a systematic lens for characterizing benchmark demands, and that prioritizing one dimension often comes at the expense of the other. To address this, we introduce a simple benchmark-aware data selection framework that adapts training data to the dominant alignment factor of each benchmark. Across twelve diverse benchmarks, our approach yields consistent improvements, especially in low-data regimes (+0.9\% over the best existing baseline on average and +1.2\% on the skill-focused subset). More broadly, our findings highlight that advancing multimodal learning requires explicit recognition of the dual role of concepts and skills in shaping benchmark behavior.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 6402
Loading