Better Together: Leveraging Unpaired Multimodal Data for Stronger Unimodal Models

ICLR 2026 Conference Submission14635 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Unpaired Multimodal Representation Learning, Cross-modal Learning; Multimodal Learning from Unpaired Data
TL;DR: We show that incorporating auxiliary unpaired multimodal data can significantly improve performance on an individual modality.
Abstract: Traditional multimodal learners find unified representations for tasks like visual question answering, but rely heavily on large paired datasets. However, an overlooked yet potentially powerful question is: can one leverage auxiliary $\textit{unpaired}$ multimodal data to directly enhance representation learning in a $\textit{target}$ modality? We introduce $\textbf{UML}$: $\textbf{U}$npaired $\textbf{M}$ultimodal $\textbf{L}$earner, a modality-agnostic training paradigm in which a single model alternately processes inputs from different modalities while sharing parameters across them. This design exploits the assumption that different modalities are projections of a shared underlying reality, allowing the model to benefit from cross-modal structure without requiring explicit pairs. Theoretically, under linear data-generating assumptions, we show that unpaired auxiliary data can yield representations strictly more informative about the world than unimodal training. Empirically, we show that incorporating unpaired data from auxiliary modalities—such as text, audio, or images—consistently improves downstream performance across diverse unimodal targets such as image and audio.
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Submission Number: 14635
Loading