Abstract: Multimodal learning typically requires expensive paired data for training and assumes all modalities are available at inference. Many real-world scenarios, however, involve unpaired and heterogeneous data distributed across institutions, making collaboration challenging. We introduce Unpaired Multimodal Learning (UML) as the problem of leveraging semantically related but unaligned data across modalities, without requiring explicit pairing or multimodal inference. This setting naturally arises in collaborative scenarios such as satellite imagery, where institutions collect data from diverse sensors (optical, multispectral, SAR), but paired acquisitions are rare, and data sharing is restricted. We propose a collaborative framework that combines modality-specific projections with a shared backbone, enabling cross-modal knowledge transfer without paired samples. A key element is post-hoc batch normalization calibration, which adapts the shared model to each modality. Our framework also extends naturally to federated training across institutions. Experiments on multiple satellite benchmarks and additional visual datasets show consistent improvements over unimodal baselines, with particularly strong gains for weaker modalities and in low-data regimes.
Submission Type: Regular submission (no more than 12 pages of main content)
Assigned Action Editor: ~Jiangchao_Yao1
Submission Number: 9062
Loading