TNCME: Tensor's Norm Constraints for Unsupervised Contrastive Learning of Multimodal Embeddings

Tianyu Zong; Hongzhu Yi; Yuanxiang Wang; Zhenghao Zhang; Yujia Yang; Zhenyu Guan; Jungang Xu

TNCME: Tensor's Norm Constraints for Unsupervised Contrastive Learning of Multimodal Embeddings

Tianyu Zong, Hongzhu Yi, Yuanxiang Wang, Zhenghao Zhang, Yujia Yang, Zhenyu Guan, Jungang Xu

16 Sept 2025 (modified: 12 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Multimodal Embedding, Unsupervised Contrastive Learning, Tensor Norm Alignment, Multimodal Retrieval

TL;DR: We propose TNCME, a multimodal embedding framework that improve Top-1 retrieval performance by jointly aligning both direction and magnitude of embeddings in contrastive learning.

Abstract: Multimodal embedding representation has emerged as a hot research topic and has been applied to multimodal retrieval tasks. Unsupervised contrastive learning, represented by InfoNCE, serves as the mainstream training paradigm for multimodal retrieval tasks. However, existing methods generally only optimize the directional alignment of positive pairs in the embedding space, and neglect another fundamental property of the representation tensors: magnitude. Based on this intuitive insight, we propose a \textbf{T}ensor's \textbf{N}orm \textbf{C}onstraints of \textbf{M}ultimodal \textbf{E}mbeddings framework, TNCME, which focuses on aligning the 2-norm of embedding representations between positive pairs during contrastive learning, jointly trained with the directional alignment pursued by InfoNCE. This approach optimizes the Top-1 performance of visual-language models in multimodal retrieval tasks. We first rigorously prove that the training objective of norm alignment of representations is consistent with the training logic of contrastive learning, and then adapt this objective to multimodal retrieval tasks. Based on the VLM2Vec-V2 framework, we perform training and evaluation across a total of 81 tasks spanning three representative multimodal retrieval categories: Image-Text, VisDoc-Text, and Video-Text. Experimental results demonstrate that the proposed TNCME outperforms baseline methods across all Top-1 metrics.

Supplementary Material: zip

Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning

Submission Number: 6566

Loading