Towards Multimodal Active Learning: Efficient Learning with Limited Paired Data

Published: 23 Feb 2026, Last Modified: 23 Feb 2026Accepted by TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: Active learning (AL) is a principled strategy to reduce annotation cost in data-hungry deep learning. However, existing AL algorithms focus almost exclusively on unimodal data, overlooking the substantial annotation burden in multimodal learning. We introduce the first framework for multimodal active learning with unaligned data, where the learner must actively acquire cross-modal alignments rather than labels on pre-aligned pairs. This setting captures the practical bottleneck in modern multimodal pipelines, where unimodal features are easy to obtain but high-quality alignment is costly. We develop a new algorithm that combines uncertainty and diversity principles in a modality-aware design, achieves linear-time acquisition, and applies seamlessly to both pool-based and streaming-based settings. Extensive experiments on benchmark datasets demonstrate that our approach consistently reduces multimodal annotation cost while preserving performance; for instance, on the ColorSwap dataset it cuts annotation requirements by up to 40% without loss in accuracy.
Submission Type: Regular submission (no more than 12 pages of main content)
Changes Since Last Submission: We have incorporated the relevant discussion points and additional results from the rebuttal into the revised manuscript, including: - Section 6: empirical data acquisition runtime comparison; evaluation under different metrics; and robustness to missing data and alignments. - Appendix C: streaming active learning with different streaming buffer sizes; learning with different distance metrics; and comparison against additional adapted uniform AL baselines. We also made light edits throughout for clarity and readability.
Assigned Action Editor: ~Christopher_Mutschler1
Submission Number: 6508
Loading