Joint Visual-Textual Reasoning and Visible-Infrared Modality Alignment for Person Re-Identification

Na Jiang; Yuxuan Qiu; Wei Song; Jiawei Liu; Zhiping Shi; Liyang Wang

Joint Visual-Textual Reasoning and Visible-Infrared Modality Alignment for Person Re-Identification

Na Jiang, Yuxuan Qiu, Wei Song, Jiawei Liu, Zhiping Shi, Liyang Wang

Published: 01 Jan 2024, Last Modified: 12 Apr 2025ICME 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Visible-infrared person re-identification (VI-ReID) is a cross-modality fine-grained classification task. Existing approaches for VI-ReID mainly explore modality-invariant features for person retrieval. However, modality-invariant features pay more attention to global contexts, due to the lack of texture information in infrared images. This leads to a person with similar silhouette often being misidentified. Targeting this problem, this paper innovatively introduces natural language specification to learn global-local contexts for VI-ReID. Specifically, our framework jointly optimizes visible-infrared alignment (VIA) and visual-textual reasoning (VTR). VIA achieves cross-modal between RGB and IR. It can explicitly utilize designed modality-guided alignment and relationship-reinforced fusion to explore the potential of local cues in global features. VTR proposes the pooling selection and dual-level reasoning mechanisms to force the image encoder to pay attention to significant regions based on textual descriptions. Extensive experimental results on the popular SYSU-MM01 and RegDB datasets show that the proposed method significantly outperforms state-of-the-art approaches.

Loading