Exploiting Vision-Language Model for Visible-Infrared Person Re-identification via Textual Modality Alignment
Abstract: Visible-Infrared Person Re-identification (VI-ReID) aims at matching the images of specific person captured by different modality cameras. Previous methods introduce a synthesized auxiliary modality to relieve the modality discrepancy. However, they directly fuse the raw pixels of visible and infrared images, ignoring the high-level semantic patterns. Additionally, the huge modality gap can’t be bridged up closely, which leads to the oscillations in the feature space. Thus, in this paper, we propose a novel Textual Modality Alignment Learning method, named TMAL, which tackles these two issues in a unified two-stage framework. Specifically, we first exploit the semantic alignment in CLIP model through learnable text tokens, which are then encoded to form semantic representations of each identity. In the second stage, we propose the Modality Alignment Module, empowering the image encoder with modality-shared and modality-specific features. We also introduce the Identity Enhancement module (IEM) to extract more informative modality-specific features. Experiments on two benchmarks demonstrate the efficacy of our method.
Loading