Abstract: Large-scale multi-modal data has made major advancements in multi-modal large language models (MLLMs). However, it is still underexplored about how to discern high-quality and diverse multi-modal training data.
In this work, we introduce DOSE, a multi-modal data selection method that identifies a compact data subset for efficient model training, driven by the data distribution based on quality scores. The key idea is to leverage existing models, which have not seen the data to be filtered, to evaluate the image-text relevance and text quality, and construct the combined distributions for rejection sampling, thereby identifying high-quality samples to optimize the model performance and training efficiency.
Extensive experiments demonstrate that our method can acquire a high-quality data subset that well maintain the model performance, thereby improving training efficiency. For example, the models trained on our chosen data (40% of LLaVA-665K and 20% of MathV360K) can achieve same performance as the base models trained by the full dataset. Further, when trained on a larger subset, our performance can even exceed that of using the full dataset.
Paper Type: Long
Research Area: NLP Applications
Research Area Keywords: Multimodal, Data Selection
Contribution Types: Approaches to low-resource settings, Data analysis
Languages Studied: English
Submission Number: 3236
Loading