Connect, Collapse, Corrupt: Learning Cross-Modal Tasks with Uni-Modal Data

Yuhui Zhang; Elaine Sui; Serena Yeung

Connect, Collapse, Corrupt: Learning Cross-Modal Tasks with Uni-Modal Data

Yuhui Zhang, Elaine Sui, Serena Yeung

Published: 16 Jan 2024, Last Modified: 05 Mar 2024ICLR 2024 posterEveryoneRevisionsBibTeX

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Keywords: multi-modal contrastive learning, captioning, text-to-image generation

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.

TL;DR: Our work explains the geometry of multi-modal contrastive representation space and introduces a three-step method to bridge the modality gap, achieving state-of-the-art results in zero-shot captioning and text-to-image generation.

Abstract: Building cross-modal applications is challenging due to limited paired multi-modal data. Recent works have shown that leveraging a pre-trained multi-modal contrastive representation space enables cross-modal tasks to be learned from uni-modal data. This is based on the assumption that contrastive optimization makes embeddings from different modalities interchangeable. However, this assumption is under-explored due to the poorly understood geometry of the multi-modal contrastive space, where a modality gap exists. In our study, we provide a theoretical explanation of this space's geometry and introduce a three-step method, $C^3$ (Connect, Collapse, Corrupt), to bridge the modality gap, enhancing the interchangeability of embeddings. Our $C^3$ method significantly improves cross-modal learning from uni-modal data, achieving state-of-the-art results on zero-shot image / audio / video captioning and text-to-image generation.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Primary Area: representation learning for computer vision, audio, language, and other modalities

Submission Number: 813

Loading