Keywords: Visual Instruction Tuning, Data Selection
Abstract: Visual instruction tuning is the key to building large vision language mod-
els (LVLMs), which can greatly improve the task generalization and solving capa-
bilities by learning a mixture of instruction data from diverse visual tasks. Previ-
ous work mostly collects multiple existing visual instruction datasets via heuristic
ways for training (even more than a million instructions), which may introduce
data redundancy and enlarge the training cost. To investigate this issue, we con-
duct a series of empirical studies, which reveal a significant redundancy within the
visual instruction datasets, and show that greatly reducing the amount of instruc-
tions from several tasks even do not affect the performance. Based on the findings,
we propose a high-value data selection approach $\textbf{TIVE}$, to eliminate redundancy
within the visual instruction data and reduce the training cost. In TIVE, we first
estimate the instance influence score on its corresponding task, and the task dif-
ficulty score, based on the gradient-based influence functions. Then, we leverage
the two kinds of scores to determine the task proportion within the selected visual
instruction subset, and select high-value instances for each task, respectively. Ex-
periments on various LVLMs show that our approach using only about 15% data
can achieve comparable average performance to the full-data fine-tuned model
across eight benchmarks, even surpassing it on four of the benchmarks. Our code
and data will be publicly released.
Supplementary Material:  zip
Primary Area: applications to computer vision, audio, language, and other modalities
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 2127
Loading