Towards Multimodal Active Learning: Efficient Learning with Limited Paired Data

Towards Multimodal Active Learning: Efficient Learning with Limited Paired Data

ICLR 2026 Conference Submission13601 Authors

18 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Active Learning, Multimodal Learning, Contrastive Multimodal Models

Abstract: Active learning (AL) is a principled strategy to reduce annotation cost in data-hungry deep learning. However, existing AL algorithms focus almost exclusively on single-modality data, overlooking the substantial annotation burden in multimodal learning. We introduce the first framework for $\textit{multimodal active learning with unaligned data}$, where the learner must actively acquire cross-modal alignments rather than labels on pre-aligned pairs. This setting captures the practical bottleneck in modern multimodal pipelines such as CLIP and SigLIP, where unimodal features are easy to obtain but high-quality alignment is costly. We develop a new algorithm that combines uncertainty and diversity principles in a modality-aware design, achieves linear-time acquisition, and applies seamlessly to both pool-based and streaming-based settings. Extensive experiments on benchmark datasets demonstrate that our approach consistently reduces multimodal annotation cost while preserving performance; for instance, on the ColorSwap dataset it cuts annotation requirements by up to 40% without loss in accuracy.

Primary Area: other topics in machine learning (i.e., none of the above)

Submission Number: 13601

Loading