## ICLR 2026 conference submission

### TNCME: Tensor's Norm Constraints for Unsupervised Contrastive Learning of Multimodal Embeddings.


#### Abstract


Multimodal embedding representation has emerged as a hot research topic and has been applied to multimodal retrieval tasks. Unsupervised contrastive learning, represented by InfoNCE, serves as the mainstream training paradigm for multimodal retrieval tasks. However, existing methods generally only optimize the directional alignment of positive pairs in the embedding space, neglecting the dual fundamental properties of embedding vectors: “direction” and “magnitude.” Based on this intuitive insight, we propose a novel multimodal embedding representation framework, TNCME. It focuses on aligning the 2-norm of embedding representations between positive pairs during contrastive learning, jointly trained with the directional alignment pursued by InfoNCE. This approach optimizes the Top-1 performance of visual-language models in multimodal retrieval tasks.
We first rigorously prove that the training objective of norm alignment of representations is consistent with the training logic of contrastive learning, and then adapt this objective to multimodal retrieval tasks. Based on the VLM2Vec-V2 framework, we perform training and evaluation across a total of 81 tasks spanning three representative multimodal retrieval categories: Image-Text, VisDoc-Text, and Video-Text. Results demonstrate that the proposed TNCME outperforms baseline methods across all Top-1 metrics.



#### Introduction to the Repository

This repository implements the TNCME framework based on [VLM2Vec-V2](https://github.com/TIGER-AI-Lab/VLM2Vec), with training and dataset configurations consistent with VLM2Vec-V2. This documentation outlines the source code of TNCME updated relative to VLM2Vec-V2.

We implement InfoTN in **train/src/loss.py**, implement the Norm Alignment Projector in **train/src/model/vlm_backbone/qwen2_vl/modeling_qwen2_vl_copy.py**, and declare its LoRA fine-tuning in **train/src/arguments.py**.

#### Environment Installation

Please first install the dependencies in a Python 3.12 environment according to **requirements.txt**.

#### Experiment 

Please download and extract the [training](https://huggingface.co/datasets/TIGER-Lab/MMEB-train) and [testing](https://huggingface.co/datasets/TIGER-Lab/MMEB-V2) datasets according to the VLM2Vec-V2, and download the [backbone model](https://huggingface.co/Qwen/Qwen2-VL-2B-Instruct) (Qwen2VL-Instruct-2B). Based on the experimental setup in this work, the training dataset is divided into Image-Text Only, Image-Text and Video-Text, and All Training Set. Please specify the corresponding **YAML** files in the shell script.

#### Training 

```bash
    # Training TNCME
    bash train/train_tncme.sh
```

#### Evaluation


```bash
    # Eval your CKPT
    bash eval/eval_8gpu.sh
```

#### Experiment Results


|**Method**       |  **VLM2Vec-V2**  |  **VLM2Vec-V2**   |  **VLM2Vec-V2**  |    **TNCME**     |     **TNCME**     |    **TNCME**     |  **VLM2Vec-V2**  |  **VLM2Vec-V2**   |  **VLM2Vec-V2**  |    **TNCME**     |     **TNCME**     |    **TNCME**     |  **VLM2Vec-V2**  |  **VLM2Vec-V2**   |  **VLM2Vec-V2**  |    **TNCME**     |     **TNCME**     |    **TNCME**     |
|------------------|:-----------------:|:------------------:|:-----------------:|:-----------------:|:-----------------:|:-----------------:|:-----------------:|:------------------:|:-----------------:|:-----------------:|:------------------:|:-----------------:|:-----------------:|:------------------:|:-----------------:|:-----------------:|:------------------:|:-----------------:|
|**Metric**       |**Image 36 Avg.** |**Visdoc 27 Avg.** |**Video 18 Avg.** |**Image 36 Avg.** |**Visdoc 27 Avg.** |**Video 18 Avg.** |**Image 36 Avg.** |**Visdoc 27 Avg.** |**Video 18 Avg.** |**Image 36 Avg.** |**Visdoc 27 Avg.** |**Video 18 Avg.** |**Image 36 Avg.** |**Visdoc 27 Avg.** |**Video 18 Avg.** |**Image 36 Avg.** |**Visdoc 27 Avg.** |**Video 18 Avg.** |
|**Hit@1**        |       63.7        |        22.9        |       30.0        |    **64.9**     |     **24.5**      |    **31.4**     |       64.5        |        47.9        |       32.9        |    **65.3**     |     **49.1**     |    **33.0**     |       62.4        |        49.4        |       35.0        |    **62.8**     |     **50.3**     |    **35.2**     |
|**Hit@5**        |       84.1        |        42.5        |       70.3        |    **85.0**     |     **43.0**      |    **72.1**     |       84.5        |        71.1        |    **73.2**     |    **84.9**     |     **72.2**     |       72.9        |    **83.9**     |     **73.0**     |    **74.1**     |    **83.9**     |        72.2        |    **74.1**     |
|**Hit@10**       |       88.7        |        **52.1**        |       82.0        |    **89.5**     |     **52.1**      |    **83.1**     |       89.3        |        78.6        |    **83.6**     |    **89.5**     |     **80.1**     |       83.2        |    **88.7**     |     **79.6**     |    **84.3**     |    **88.7**     |        79.1        |    **84.3**     |
|**NDCG_L@1**     |       63.7        |        21.6        |       30.0        |    **64.9**     |     **23.3**      |    **31.4**     |       64.5        |        46.2        |       32.9        |    **65.3**     |     **47.4**     |    **33.0**     |       62.4        |        47.8        |       35.0        |    **62.8**     |     **48.6**     |    **35.2**     |
|**NDCG_L@5**     |       75.0        |        28.5        |       51.3        |    **76.1**     |     **29.7**      |    **52.8**     |       75.6        |        54.7        |       **54.2**        |    **76.2**     |     **56.0**     |    **54.2**     |       74.3        |        56.2        |       55.7        |    **74.5**     |     **56.3**     |    **55.9**     |
|**NDCG_L@10**    |       76.5        |        30.9        |       55.1        |    **77.6**     |     **32.0**      |    **56.4**     |       77.1        |        56.7        |       57.5        |    **77.7**     |     **58.3**     |    **57.6**     |       75.9        |        57.9        |       59.0        |    **76.1**     |     **58.0**     |    **59.2**     |
|**NDCG_E@1**     |       63.7        |        20.7        |       30.0        |    **64.9**     |     **22.5**      |    **31.4**     |       64.5        |        45.1        |       32.9        |    **65.3**     |     **46.2**     |    **33.0**     |       62.4        |        46.7        |       35.0        |    **62.8**     |     **47.4**     |    **35.2**     |
|**NDCG_E@5**     |       75.0        |        28.0        |       51.3        |    **76.1**     |     **29.3**      |    **52.8**     |       75.6        |        54.0        |       **54.2**        |    **76.2**     |     **55.4**     |    **54.2**     |       74.3        |        55.6        |       55.7        |    **74.5**     |     **55.7**     |    **55.9**     |
|**NDCG_E@10**    |       76.5        |        30.6        |       55.1        |    **77.6**     |     **31.7**      |    **56.4**     |       77.1        |        56.3        |       57.5        |    **77.7**     |     **57.9**     |    **57.6**     |       75.9        |        57.5        |       59.0        |    **76.1**     |     **57.6**     |    **59.2**     |
|**Precision@1**  |       63.7        |        22.9        |       30.0        |    **64.9**     |     **24.5**      |    **31.4**     |       64.5        |        47.9        |       32.9        |    **65.3**     |     **49.1**     |    **33.0**     |       62.4        |        49.4        |       35.0        |    **62.8**     |     **50.3**     |    **35.2**     |
|**Precision@5**  |       16.8        |     **11.9**     |       14.1        |    **17.0**     |       11.4        |    **14.5**     |       16.9        |        19.9        |    **14.7**     |    **17.0**     |     **20.5**     |       14.7        |    **16.8**     |     **20.5**     |    **14.9**     |    **16.8**     |     **20.5**     |    **14.9**     |
|**Precision@10** |        8.9        |     **8.8**      |        8.3        |     **9.0**     |        8.1        |     **8.4**     |        8.9        |        13.7        |     **8.4**     |     **9.0**     |     **14.1**     |        8.4        |     **8.9**     |     **13.6**     |     **8.5**     |     **8.9**     |     **13.6**     |     **8.5**     |
|**Recall@1**     |       63.7        |        16.2        |       29.9        |    **64.9**     |     **18.3**      |    **31.3**     |       64.5        |        36.9        |       32.8        |    **65.3**     |     **38.3**     |    **32.9**     |       62.4        |        38.2        |       34.8        |    **62.8**     |     **38.6**     |    **35.1**     |
|**Recall@5**     |       84.1        |        31.5        |       70.2        |    **85.0**     |     **32.9**      |    **72.0**     |       84.5        |        57.5        |    **73.2**     |    **84.9**     |     **58.5**     |       72.8        |    **83.9**     |     **59.1**     |       74.0        |    **83.9**     |        58.6        |    **74.1**     |
|**Recall@10**    |       88.7        |        40.2        |       82.0        |    **89.5**     |     **40.9**      |    **83.1**     |       89.3        |        65.7        |    **83.6**     |    **89.5**     |     **67.4**     |       83.2        |    **88.7**     |     **66.6**     |    **84.3**     |    **88.7**     |        66.2        |    **84.3**     |
|**F@1**          |       63.7        |        16.8        |       29.9        |    **64.9**     |     **18.9**      |    **31.3**     |       64.5        |        38.1        |       **32.9**        |    **65.3**     |     **39.5**     |    **32.9**     |       62.4        |        39.5        |       34.9        |    **62.8**     |     **40.0**     |    **35.1**     |
|**F@5**          |       28.1        |        13.0        |       23.5        |    **28.3**     |     **13.2**      |    **24.1**     |       28.2        |        23.2        |    **24.5**     |    **28.3**     |     **23.8**     |       24.4        |    **28.0**     |     **23.9**     |    **24.8**     |    **28.0**     |     **23.9**     |    **24.8**     |
|**F@10**         |       16.1        |        **10.7**        |       15.0        |    **16.3**     |       10.3        |    **15.2**     |       16.2        |        17.2        |    **15.3**     |    **16.3**     |     **17.7**     |       15.2        |    **16.1**     |     **17.3**     |    **15.4**     |    **16.1**     |        17.2        |    **15.4**     |
|**MAP@1**        |       63.7        |        22.9        |       30.0        |    **64.9**     |     **24.5**      |    **31.4**     |       64.5        |        47.9        |       32.9        |    **65.3**     |     **49.1**     |    **33.0**     |       62.4        |        49.4        |       35.0        |    **62.6**     |     **50.3**     |    **35.2**     |
|**MAP@5**        |       72.0        |        25.2        |       44.9        |    **73.1**     |     **26.7**      |    **46.4**     |       72.6        |        50.5        |       47.8        |    **73.2**     |     **52.1**     |    **48.0**     |       71.1        |        51.9        |       49.5        |    **71.3**     |     **52.2**     |    **49.9**     |
|**MAP@10**       |       72.6        |        25.6        |       46.5        |    **73.7**     |     **27.0**      |    **48.0**     |       73.2        |        50.6        |       49.2        |    **73.9**     |     **52.3**     |    **49.4**     |       71.7        |        51.7        |       50.9        |    **72.0**     |     **52.0**     |    **51.1**     |
|**MRR@1**        |       63.7        |        22.9        |       30.0        |    **64.9**     |     **24.5**      |    **31.4**     |       64.5        |        47.9        |       32.9        |    **65.3**     |     **49.1**     |    **33.0**     |       62.4        |        49.4        |       35.0        |    **62.8**     |     **50.2**     |    **35.2**     |
|**MRR@5**        |       72.0        |        29.9        |       45.0        |    **73.1**     |     **31.1**      |    **46.5**     |       72.6        |        56.7        |       47.9        |    **73.2**     |     **57.9**     |    **48.0**     |       71.1        |        58.5        |       49.6        |    **71.3**     |     **58.8**     |    **49.9**     |
|**MRR@10**       |       72.6        |        31.2        |       46.7        |    **73.7**     |     **32.3**      |    **48.0**     |       73.2        |        57.7        |       49.3        |    **73.9**     |     **59.0**     |    **49.4**     |       71.7        |        59.4        |       50.9        |    **72.0**     |     **59.7**     |    **51.2**     |
|**Avg**          |       63.2        |        25.3        |       41.7        |    **64.1**     |     **26.4**      |    **42.9**     |       63.8        |        47.6        |    **44.1**     |    **64.3**     |     **48.8**     |       44.0        |       62.5        |        48.8        |       45.4        |    **62.7**     |     **49.0**     |    **45.6**     |

| **Method**       |  **VLM2Vec-V2** |     **TNCME**    |       **VLM2Vec-V2**       |          **TNCME**         | **VLM2Vec-V2** |     **TNCME**    |
|------------------|:---------------:|:----------------:|:--------------------------:|:--------------------------:|:--------------:|:----------------:|
| **Training Set** | Image-Text Only |  Image-Text Only | Image-Text and VisDoc-Text | Image-Text and VisDoc-Text |       All      |        All       |
| **Avg@1**        |      41.9       | **43.3 (+1.4%)** |            50.9            |      **51.7 (+0.8%)**      |      50.9      | **51.4 (+0.5%)** |


