From Factors to Methods: A Comprehensive Survey on Visual Instruction Tuning Data Selection

ACL ARR 2026 January Submission733 Authors

24 Dec 2025 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Visual Instruction Tuning, Data Selection, Visual-language Models, Survey
Abstract: The progress of visual–language models has made visual instruction tuning central to multimodal alignment, yet its effectiveness depends heavily on the composition of training data. Visual instruction datasets are often heterogeneous and redundant, necessitating principled data selection to ensure downstream performance. Despite growing interest, prior studies remain fragmented, relying on disparate evaluation criteria and inconsistent terminology that obscure the underlying design space. To address this, we present the first comprehensive survey of visual instruction data selection, providing a unified perspective on both evaluation and selection mechanisms. We introduce a factor-based analytical framework that organizes core data properties into a structured Data Evaluation Factor Library. Furthermore, we categorize existing methods into feature-based, prediction-based, gradient-based, and hybrid paradigms, analyzing how they operationalize scoring and filtering. By bridging evaluation factors with selection mechanisms, this survey consolidates fragmented insights and outlines future directions for data-efficient multimodal learning.
Paper Type: Long
Research Area: LLM Efficiency
Research Area Keywords: Multimodality and Language Grounding to Vision, Robotics and Beyond
Contribution Types: Surveys
Languages Studied: English
Submission Number: 733
Loading