DOSE: Data Selection for Multi-Modal LLMs via Off-the-Shelf Models

DOSE: Data Selection for Multi-Modal LLMs via Off-the-Shelf Models

ACL ARR 2025 July Submission1051 Authors

29 Jul 2025 (modified: 01 Sept 2025)ACL ARR 2025 July SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Large-scale multi-modal data has made major advancements in multi-modal large language models (MLLMs). However, it is still underexplored about how to discern high-quality and diverse multi-modal training data. In this work, we introduce DOSE, a multi-modal data selection method that identifies a compact data subset for efficient model training, driven by the data distribution based on quality scores. The key idea is to leverage existing models, which have not seen the data to be filtered, to evaluate the image-text relevance and text quality, and construct the combined distributions for rejection sampling, thereby identifying high-quality samples to optimize the model performance and training efficiency. Extensive experiments demonstrate that our method can acquire a high-quality data subset that well maintain the model performance, thereby improving training efficiency. For example, the models trained on our chosen data (40% of LLaVA-665K and 20% of MathV360K) can achieve same performance as the base models trained by the full dataset. Further, when trained on a larger subset, our performance can even exceed that of using the full dataset.

Paper Type: Long

Research Area: NLP Applications

Research Area Keywords: Multimodal, Data Selection

Contribution Types: Approaches to low-resource settings, Data analysis

Languages Studied: English

Previous URL: https://openreview.net/forum?id=xdaTDIxBa5

Explanation Of Revisions PDF: pdf

Reassignment Request Area Chair: Yes, I want a different area chair for our submission

Reassignment Request Reviewers: Yes, I want a different set of reviewers

A1 Limitations Section: This paper has a limitations section.

A2 Potential Risks: Yes

A2 Elaboration: Page 9

B Use Or Create Scientific Artifacts: Yes

B1 Cite Creators Of Artifacts: Yes

B1 Elaboration: Sec 4

B2 Discuss The License For Artifacts: N/A

B3 Artifact Use Consistent With Intended Use: Yes

B3 Elaboration: We ensure that all existing artifacts are used in accordance with their intended use as specified by their licenses or documentation.

B4 Data Contains Personally Identifying Info Or Offensive Content: N/A

B5 Documentation Of Artifacts: N/A

B6 Statistics For Data: Yes

B6 Elaboration: Sec 4.1

C Computational Experiments: Yes

C1 Model Size And Budget: Yes

C1 Elaboration: Sec 4.1

C2 Experimental Setup And Hyperparameters: Yes

C2 Elaboration: Sec 4.1

C3 Descriptive Statistics: Yes

C3 Elaboration: Sec 4.2 and Sec 4.3

C4 Parameters For Packages: Yes

C4 Elaboration: Sec 4.1

D Human Subjects Including Annotators: No

D1 Instructions Given To Participants: N/A

D2 Recruitment And Payment: N/A

D3 Data Consent: N/A

D4 Ethics Review Board Approval: N/A

D5 Characteristics Of Annotators: N/A

E Ai Assistants In Research Or Writing: No

E1 Information About Use Of Ai Assistants: N/A

Author Submission Checklist: no

Submission Number: 1051

Loading