Track: long paper (up to 10 pages)
Domain: machine learning
Abstract: In multimodal learning, CLIP has emerged as the de facto approach for mapping different modalities into a shared latent space by bringing semantically similar representations closer while pushing apart dissimilar ones. However, CLIP-based contrastive losses exhibit unintended behaviors that negatively impact true semantic alignment, leading to sparse and fragmented latent spaces. This phenomenon, known as the modality gap, has been partially mitigated for standard text and image pairs, but remains unresolved in more complex multimodal settings, especially when integrating three or more modalities.
In this work, we propose a modality-agnostic framework that definitively closes the modality gap across multiple modalities, ensuring that semantically related representations are perfectly aligned, regardless of their source modality. Beyond theoretical improvements, we demonstrate that closing the modality gap has profound implications for real-world applications. In semantic communication, our approach enables the transmission of a single compact representation per semantic concept, drastically reducing bandwidth requirements while preserving multimodal reconstruction quality. In medical multimodal learning, our method enhances alignment between radiology images and clinical text, improving cross-modal retrieval and image captioning. We show that our approach not only closes the modality gap permanently but also unlocks new capabilities in downstream applications that were previously limited by poor cross-modal alignment.
Submission Number: 33
Loading