Keywords: perfect alignment, multimodal learning, signal processing, cross-modal transfer
TL;DR: We view contrastive alignment as an inverse problem from signal processing and can recover a perfect alignment under certain conditions.
Abstract: Multimodal alignment aims to construct a joint latent vector space where two modalities representing the same concept map to the same vector. We formulate this as an inverse problem and show that, under certain conditions, perfect alignment can be achieved.
When perfect alignment cannot be achieved, it can be approximated using the Singular Value Decomposition (SVD) of a multimodal data matrix. Experiments on synthetic multimodal Gaussian data verify the effectiveness of our perfect alignment method compared to the popular contrastive alignment method. We discuss how these findings can be applied to visual data and sensor data for unsupervised cross-modal transfer. We hope these findings inspire further exploration of the applications of perfect alignment for cross-modal learning.
Submission Number: 29
Loading