Exploiting Vision-Language Model for Visible-Infrared Person Re-identification via Textual Modality Alignment

Bingyu Duan, Wanqian Zhang, Dayan Wu, Zheng Lin, Jingzi Gu, Weiping Wang

Published: 01 Jan 2024, Last Modified: 13 Nov 2024ICME 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Visible-Infrared Person Re-identification (VI-ReID) aims at matching the images of specific person captured by different modality cameras. Previous methods introduce a synthesized auxiliary modality to relieve the modality discrepancy. However, they directly fuse the raw pixels of visible and infrared images, ignoring the high-level semantic patterns. Additionally, the huge modality gap can’t be bridged up closely, which leads to the oscillations in the feature space. Thus, in this paper, we propose a novel Textual Modality Alignment Learning method, named TMAL, which tackles these two issues in a unified two-stage framework. Specifically, we first exploit the semantic alignment in CLIP model through learnable text tokens, which are then encoded to form semantic representations of each identity. In the second stage, we propose the Modality Alignment Module, empowering the image encoder with modality-shared and modality-specific features. We also introduce the Identity Enhancement module (IEM) to extract more informative modality-specific features. Experiments on two benchmarks demonstrate the efficacy of our method.