Select the Key, Then Generate the Rest: Improving Multi-Modal Learning with Limited Data Budget

Select the Key, Then Generate the Rest: Improving Multi-Modal Learning with Limited Data Budget

ICLR 2026 Conference Submission14374 Authors

18 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Multimodal learning, Data-efficient Learning

TL;DR: We demonstrate that multimodal models can learn effectively, even surpassing full-data training, by strategically selecting only key modalities to collect data for and generating the rest.

Abstract: Multimodal learning serves as a promising approach for applications with diverse information sources. However, when scaling up multimodal learning data, there are numerous challenges for preparing new data from all modalities due to availability or varied cost of data collection. In this paper, we are the first to demonstrate that multimodal models with only a subset of modalities available for new data could reach and even surpass models continuously trained with full modalities. Our research problem is formulated as: given a limited data collection budget, how to find the appropriate modalities to collect new data and generate for the rest to maximize model performance gain? To answer this, we propose a new paradigm - Select the Key modality, then generate the rest to enable learning with limited data (SK-ll). SK-ll contains two key components: (1) Select the key. We propose a modality importance indicator to find the optimal modalities by assessing their single modal marginal contribution and cross-modal interactions. (2) Generate the rest. For the rest modalities, we substitute with generated embeddings. We conducted extensive experiments across affection computing, healthcare, and various vision-language task with diverse multimodal learning backbones to support the effectiveness of SK-ll. Meanwhile, we present interesting empirical insights such as the data efficiency.

Supplementary Material: zip

Primary Area: applications to computer vision, audio, language, and other modalities

Submission Number: 14374

Loading