Concept or Skills? Rethinking Instruction Selection for Multi-modal Models

Andrew Bai; Justin Cui; Ruochen Wang; Cho-Jui Hsieh

Concept or Skills? Rethinking Instruction Selection for Multi-modal Models

Andrew Bai, Justin Cui, Ruochen Wang, Cho-Jui Hsieh

15 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: instruction selection, multimodality, vision-language model, data selection, instruction-tuning, SFT

Abstract: Most existing instruction selection methods in vision-language learning rely on sample embeddings to guide data choice. These embeddings are typically derived from pure vision encoders or small multimodal models and they primarily capture \emph{visual concepts} while under-representing \emph{visual skills} such as counting, spatial reasoning, or commonsense inference. This imbalance overlooks a key distinction: multimodal benchmarks vary widely in whether they emphasize conceptual grounding or skill-based reasoning. We show that this concept--skill axis provides a systematic lens for characterizing benchmark demands, and that prioritizing one dimension often comes at the expense of the other. To address this, we introduce a simple benchmark-aware data selection framework that adapts training data to the dominant alignment factor of each benchmark. Across twelve diverse benchmarks, our approach yields consistent improvements, especially in low-data regimes (+0.9\% over the best existing baseline on average and +1.2\% on the skill-focused subset). More broadly, our findings highlight that advancing multimodal learning requires explicit recognition of the dual role of concepts and skills in shaping benchmark behavior.

Primary Area: foundation or frontier models, including LLMs

Submission Number: 6402

Loading