Abstract: Visible-infrared person re-identification (VI-ReID) is a cross-modality fine-grained classification task. Existing approaches for VI-ReID mainly explore modality-invariant features for person retrieval. However, modality-invariant features pay more attention to global contexts, due to the lack of texture information in infrared images. This leads to a person with similar silhouette often being misidentified. Targeting this problem, this paper innovatively introduces natural language specification to learn global-local contexts for VI-ReID. Specifically, our framework jointly optimizes visible-infrared alignment (VIA) and visual-textual reasoning (VTR). VIA achieves cross-modal between RGB and IR. It can explicitly utilize designed modality-guided alignment and relationship-reinforced fusion to explore the potential of local cues in global features. VTR proposes the pooling selection and dual-level reasoning mechanisms to force the image encoder to pay attention to significant regions based on textual descriptions. Extensive experimental results on the popular SYSU-MM01 and RegDB datasets show that the proposed method significantly outperforms state-of-the-art approaches.
Loading