Keywords: data selection, multi-modal, MLLMs, supervised finetuning
Abstract: The hypothesis that pretrained large language models (LLMs) necessitate merely limited supervision during the fine-tuning (SFT) stage has been substantiated by recent advancements in data curation and selection research. However, their stability and generalizability are challenged when they show vulnerability to experimental setups and validation protocols, falling short of surpassing random sampling. Built upon LLMs, multi-modal LLMs (MLLMs), combined with the sheer token volume and heightened heterogeneity of data sources, amplify both the significance and complexity of data selection. To harvest multi-modal instructional data in a robust, efficient and transferable manner, we re-define the granularity of the quality metric by decomposing it into 14 vision-language-related interpretable capabilities, and introduce multi-modal rich scorers to evaluate the corresponding value for each sample. In light of the inherent objective of the instructional stage, we take interactive styles as a superficial diversity indicator, and use a multi-modal rich styler to partition candidate data. In doing so, our \textbf{m}ulti-\textbf{m}odal \textbf{r}ich \textbf{s}corers and \textbf{s}tyler (mmSSR) guarantee that high-scoring information is delivered to users in diversified forms. Free from embedding-based clustering or greedy sampling, mmSSR efficiently scales to millions of data with varying budget constraints, supports general and specific capability customization, and facilitates training-free transfer to new domains for curation. Across 10+ experimental settings, validated by 14 multi-modal benchmarks, we demonstrate consistent improvements over random sampling, baseline strategies and state-of-the-art selection methods, achieving 99.1% of full performance using only 30% of the 2.6M data.
Supplementary Material: zip
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 567
Loading