DOSE: Data Selection for Multi-Modal LLMs via Off-the-Shelf Models

ACL ARR 2025 July Submission1051 Authors

29 Jul 2025 (modified: 01 Sept 2025)ACL ARR 2025 July SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Large-scale multi-modal data has made major advancements in multi-modal large language models (MLLMs). However, it is still underexplored about how to discern high-quality and diverse multi-modal training data. In this work, we introduce DOSE, a multi-modal data selection method that identifies a compact data subset for efficient model training, driven by the data distribution based on quality scores. The key idea is to leverage existing models, which have not seen the data to be filtered, to evaluate the image-text relevance and text quality, and construct the combined distributions for rejection sampling, thereby identifying high-quality samples to optimize the model performance and training efficiency. Extensive experiments demonstrate that our method can acquire a high-quality data subset that well maintain the model performance, thereby improving training efficiency. For example, the models trained on our chosen data (40% of LLaVA-665K and 20% of MathV360K) can achieve same performance as the base models trained by the full dataset. Further, when trained on a larger subset, our performance can even exceed that of using the full dataset.
Paper Type: Long
Research Area: NLP Applications
Research Area Keywords: Multimodal, Data Selection
Contribution Types: Approaches to low-resource settings, Data analysis
Languages Studied: English
Previous URL: https://openreview.net/forum?id=xdaTDIxBa5
Explanation Of Revisions PDF: pdf
Reassignment Request Area Chair: Yes, I want a different area chair for our submission
Reassignment Request Reviewers: Yes, I want a different set of reviewers
A1 Limitations Section: This paper has a limitations section.
A2 Potential Risks: Yes
A2 Elaboration: Page 9
B Use Or Create Scientific Artifacts: Yes
B1 Cite Creators Of Artifacts: Yes
B1 Elaboration: Sec 4
B2 Discuss The License For Artifacts: N/A
B3 Artifact Use Consistent With Intended Use: Yes
B3 Elaboration: We ensure that all existing artifacts are used in accordance with their intended use as specified by their licenses or documentation.
B4 Data Contains Personally Identifying Info Or Offensive Content: N/A
B5 Documentation Of Artifacts: N/A
B6 Statistics For Data: Yes
B6 Elaboration: Sec 4.1
C Computational Experiments: Yes
C1 Model Size And Budget: Yes
C1 Elaboration: Sec 4.1
C2 Experimental Setup And Hyperparameters: Yes
C2 Elaboration: Sec 4.1
C3 Descriptive Statistics: Yes
C3 Elaboration: Sec 4.2 and Sec 4.3
C4 Parameters For Packages: Yes
C4 Elaboration: Sec 4.1
D Human Subjects Including Annotators: No
D1 Instructions Given To Participants: N/A
D2 Recruitment And Payment: N/A
D3 Data Consent: N/A
D4 Ethics Review Board Approval: N/A
D5 Characteristics Of Annotators: N/A
E Ai Assistants In Research Or Writing: No
E1 Information About Use Of Ai Assistants: N/A
Author Submission Checklist: no
Submission Number: 1051
Loading