Keywords: Visual Instruction Tuning, Data Selection, Visual-language Models, Survey
Abstract: The progress of visual–language models has made visual instruction tuning central to multimodal alignment, yet its effectiveness depends heavily on the composition of training data. Visual instruction datasets are often heterogeneous and redundant, necessitating principled data selection to ensure downstream performance. Despite growing interest, prior studies remain fragmented, relying on disparate evaluation criteria and inconsistent terminology that obscure the underlying design space. To address this, we present the first comprehensive survey of visual instruction data selection, providing a unified perspective on both evaluation and selection mechanisms. We introduce a factor-based analytical framework that organizes core data properties into a structured Data Evaluation Factor Library. Furthermore, we categorize existing methods into feature-based, prediction-based, gradient-based, and hybrid paradigms, analyzing how they operationalize scoring and filtering. By bridging evaluation factors with selection mechanisms, this survey consolidates fragmented insights and outlines future directions for data-efficient multimodal learning.
Paper Type: Long
Research Area: LLM Efficiency
Research Area Keywords: Multimodality and Language Grounding to Vision, Robotics and Beyond
Contribution Types: Surveys
Languages Studied: English
Submission Number: 733
Loading