Better Together: Leveraging Unpaired Multimodal Data for Stronger Unimodal Models

Published: 23 Sept 2025, Last Modified: 17 Nov 2025UniReps2025EveryoneRevisionsBibTeXCC BY 4.0
Supplementary Material: pdf
Track: Extended Abstract Track
Keywords: Unpaired Multimodal Representation Learning, Cross-modal Learning; Multimodal Learning from Unpaired Data
TL;DR: We show that incorporating auxiliary unpaired multimodal data can significantly improve performance on an individual modality.
Abstract: Traditional multimodal frameworks emphasize learning unified representations for tasks such as visual question answering, typically requiring paired, aligned data. However, an overlooked yet powerful question remains: can one leverage auxiliary \emph{unpaired} multimodal data to directly enhance representation learning in an $\textit{individual}$ modality? To explore this, we propose $\textbf{UML}$: $\textbf{U}$npaired $\textbf{M}$ultimodal $\textbf{L}$earner, a modality-agnostic training paradigm in which a single model alternately processes inputs from different modalities—including images, text, audio, or video—while sharing model weights across these modalities. Our approach exploits shared structure in unaligned multimodal signals, eliminating the need for paired data. We show that unpaired text improves image classification, and that other auxiliary modalities likewise enhance both image and audio tasks.
Submission Number: 31
Loading