TangentBind: Unlocking the Potential of Emergent Alignment in Multimodal Model

Jincheng Xie; Xingchen Xiao; Runheng Liu; Zhongyi Huang; Heyan Huang

TangentBind: Unlocking the Potential of Emergent Alignment in Multimodal Model

Jincheng Xie, Xingchen Xiao, Runheng Liu, Zhongyi Huang, Heyan Huang

27 Sept 2024 (modified: 05 Feb 2025)Submitted to ICLR 2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: TangentBind, Multi-modal Alignment, Optimization

TL;DR: We propose TangentBind to enhance the emergent alignment ability between indirect alignment modality and retain alignment with core modality.

Abstract: Improving the alignment of modalities has proven effective across various downstream tasks in multimodal models. Currently, modality alignment follows two main research directions: aligning all modalities simultaneously or binding the others by aligning to a core modality. The first ensures direct alignment, but it is difficult to extend to new modalities. The second is scalable but weak in emergent ability due to needing more direct inter-modality alignment. To address these problems, we propose the TangentBind. Specifically, we first align all modalities to a core modality, e.g., image or text. Then, we introduce a generative network that generates the embeddings of the second modality, e.g., text or image, based on the core modality embedding. Thirdly, other modalities, such as audio, are aligned to the core modality and generative embedding, improving emergent ability while retaining alignment with the core modality. During training, in addition to infoNCE, the Tangent Term is introduced to align the new modalities with the generated embeddings. This addresses accuracy issues caused by using generated vectors as representations for modalities. With VISION and TEXT as the core modality, our experiments include other modalities such as AUDIO, DEPTH, and INFRARED. Eventually, our experiments show that the emergent ability of TangentBind significantly outperforms the original benchmark on 9 datasets.

Primary Area: applications to computer vision, audio, language, and other modalities

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 10797

Loading