Keywords: Data selection, Large Vision Language Models, Cross-modality alignment
TL;DR: We propose a novel data selection for fine-tuning large vision language models with theoretical guarantee.
Abstract: Data-efficient learning aims to eliminate redundancy in large training datasets by train-
ing models on smaller subsets of the most informative examples. While data selection
has been extensively explored for vision models and large language models (LLMs), it
remains underexplored for Large Vision-Language Models (LVLMs). Notably, none of
existing methods can outperform random selection at different subset sizes. In this work,
we propose the first principled method for data-efficient instruction tuning of LVLMs. We
prove that examples with similar cross-modal attention matrices during instruction tun-
ing have similar gradients. Thus, they influence model parameters in a similar manner
and convey the same information to the model during training. Building on this insight,
we propose XMAS, which clusters examples based on the trajectories of the top singu-
lar values of their attention matrices obtained from fine-tuning a small proxy LVLM. By
sampling a balanced subset from these clusters, XMAS effectively removes redundancy in
large-scale LVLM training data. Extensive experiments show that XMAS can discard 50%
of the LLaVA-665k dataset and 85% of the Vision-Flan dataset while fully preserving per-
formance of LLaVA-1.5-7B on 10 downstream benchmarks and speeding up its training
by 1.2×. This is 30% more data reduction compared to the best baseline for LLaVA-665k.
Supplementary Material: zip
Primary Area: foundation or frontier models, including LLMs
Submission Number: 13952
Loading