DOSE: Data Selection for Multi-Modal LLMs via Off-the-Shelf Models

DOSE: Data Selection for Multi-Modal LLMs via Off-the-Shelf Models

ACL ARR 2025 May Submission3236 Authors

19 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Large-scale multi-modal data has made major advancements in multi-modal large language models (MLLMs). However, it is still underexplored about how to discern high-quality and diverse multi-modal training data. In this work, we introduce DOSE, a multi-modal data selection method that identifies a compact data subset for efficient model training, driven by the data distribution based on quality scores. The key idea is to leverage existing models, which have not seen the data to be filtered, to evaluate the image-text relevance and text quality, and construct the combined distributions for rejection sampling, thereby identifying high-quality samples to optimize the model performance and training efficiency. Extensive experiments demonstrate that our method can acquire a high-quality data subset that well maintain the model performance, thereby improving training efficiency. For example, the models trained on our chosen data (40% of LLaVA-665K and 20% of MathV360K) can achieve same performance as the base models trained by the full dataset. Further, when trained on a larger subset, our performance can even exceed that of using the full dataset.

Paper Type: Long

Research Area: NLP Applications

Research Area Keywords: Multimodal, Data Selection

Contribution Types: Approaches to low-resource settings, Data analysis

Languages Studied: English

Submission Number: 3236

Loading