Filter Images First, Generate Instructions Later: Pre-Instruction Data Selection for Visual Instruction Tuning
Abstract: Visual instruction tuning (VIT) for large vision-language
models (LVLMs) requires training on expansive datasets of
image-instruction pairs, which can be costly. Recent efforts
in VIT data selection aim to select a small subset of highquality image-instruction pairs, reducing VIT runtime while
maintaining performance comparable to full-scale training.
However, a major challenge often overlooked is that generating instructions from unlabeled images 1
for VIT is highly
expensive. Most existing VIT datasets rely heavily on human annotations or paid services like the GPT API, which
limits users with constrained resources from creating VIT
datasets for custom applications. To address this, we introduce Pre-Instruction Data Selection (PreSel), a more
practical data selection paradigm that directly selects the
most beneficial unlabeled images and generates instructions
only for the selected images. PreSel first estimates the
relative importance of each vision task within VIT datasets
to derive task-wise sampling budgets. It then clusters image
features within each task, selecting the most representative
images with the budget. This approach reduces computational overhead for both instruction generation during VIT
data formation and LVLM fine-tuning. By generating instructions for only 15% of the images, PreSel achieves
performance comparable to full-data VIT on the LLaVA1.5 and Vision-Flan datasets.
Loading