Aligning Multimodal Representations through an Information Bottleneck

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0
TL;DR: Contrastive losses maximize shared information but retain modality-specific content, causing misalignment. We formalize this with information theory and propose a lightweight regularizer to improve cross-modal representation alignment.
Abstract: Contrastive losses have been extensively used as a tool for multimodal representation learning. However, it has been empirically observed that their use is not effective to learn an aligned representation space. In this paper, we argue that this phenomenon is caused by the presence of modality-specific information in the representation space. Although some of the most widely used contrastive losses maximize the mutual information between representations of both modalities, they are not designed to remove the modality-specific information. We give a theoretical description of this problem through the lens of the Information Bottleneck Principle. We also empirically analyze how different hyperparameters affect the emergence of this phenomenon in a controlled experimental setup. Finally, we propose a regularization term in the loss function that is derived by means of a variational approximation and aims to increase the representational alignment. We analyze in a set of controlled experiments and real-world applications the advantages of including this regularization term.
Lay Summary: Multimodal models learn from different types of data—such as images and text—by bringing their internal representations closer together using contrastive losses. While these losses are popular, they often fail to ensure that the representations from each modality (e.g., image and caption) are truly aligned. This paper investigates why this misalignment happens. The key insight is that contrastive losses, while encouraging shared information across modalities, do not remove information that is unique to each modality (like the color of a dog in an image but not mentioned in the caption). As a result, the learned representations can remain misaligned. We explain this issue using the Information Bottleneck Principle, a theoretical framework that helps understand how much relevant information a representation keeps. We also show, through controlled experiments, how certain hyperparameters influence this problem. To address it, we propose a new regularization term—based on a variational approximation—that helps remove irrelevant, modality-specific details from the representations. Our approach improves alignment both in synthetic setups and in real-world multimodal tasks.
Link To Code: https://github.com/antonioalmudevar/multimodal_ib
Primary Area: General Machine Learning->Representation Learning
Keywords: multimodal representation learning, representational alignment, modality gap, information bottleneck
Submission Number: 11836
Loading