Keywords: Multimodal Embedding, Unsupervised Contrastive Learning, Tensor Norm Alignment, Multimodal Retrieval
TL;DR: We propose TNCME, a multimodal embedding framework that improve Top-1 retrieval performance by jointly aligning both direction and magnitude of embeddings in contrastive learning.
Abstract: Multimodal embedding representation has emerged as a hot research topic and has been applied to multimodal retrieval tasks. Unsupervised contrastive learning, represented by InfoNCE, serves as the mainstream training paradigm for multimodal retrieval tasks. However, existing methods generally only optimize the directional alignment of positive pairs in the embedding space, and neglect another fundamental property of the representation tensors: magnitude. Based on this intuitive insight, we propose a \textbf{T}ensor's \textbf{N}orm \textbf{C}onstraints of \textbf{M}ultimodal \textbf{E}mbeddings framework, TNCME, which focuses on aligning the 2-norm of embedding representations between positive pairs during contrastive learning, jointly trained with the directional alignment pursued by InfoNCE. This approach optimizes the Top-1 performance of visual-language models in multimodal retrieval tasks.
We first rigorously prove that the training objective of norm alignment of representations is consistent with the training logic of contrastive learning, and then adapt this objective to multimodal retrieval tasks. Based on the VLM2Vec-V2 framework, we perform training and evaluation across a total of 81 tasks spanning three representative multimodal retrieval categories: Image-Text, VisDoc-Text, and Video-Text.
Experimental results demonstrate that the proposed TNCME outperforms baseline methods across all Top-1 metrics.
Supplementary Material: zip
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Submission Number: 6566
Loading