CoVLR: Coordinating Cross-Modal Consistency and Intra-Modal Relations for Vision-Language Retrieval

Fengqiang Wan, Xiangyu Wu, Zhihao Guan, Yang Yang

Published: 01 Jan 2024, Last Modified: 15 May 2025ICME 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Vision-language retrieval aims to perform cross-modal instances search by learning consistent vision-language representations. However, in real applications, vision-language divergence always results in strong and weak modalities and different modalities have various performances in uni-modal tasks. In this paper, we reveal that traditional vision-language hard consistency disrupts the relationships among uni-modal instances considering the weak-strong modal scenario, causing a decline in uni-modal retrieval capability. To address this issue, we propose Coordinated Vision-Language Retrieval (CoVLR), a solution that cooperatively optimizes both cross-modal consistency and intra-modal structure-preserving objectives via a meta-learning strategy. Specifically, CoVLR utilizes intra-modal structure-preserving as the meta-test task to validate the cross-modal consistency loss, which is considered the meta-train task. The effectiveness of CoVLR is validated through extensive experiments on commonly used datasets, which demonstrate superior results compared to other baselines on various retrieval scenarios.