Advancing Visible-Infrared Person Re-Identification: Synergizing Visual-Textual Reasoning and Cross-Modal Feature Alignment
Abstract: Visible-infrared person re-identification (VI-ReID) is a critical cross-modality fine-grained classification task with significant implications for public safety and security applications. Existing VI-ReID methods primarily focus on extracting modality-invariant features for person retrieval. However, due to the inherent lack of texture information in infrared images, these modality-invariant features tend to emphasize global contexts. Consequently, individuals with similar silhouettes are often misidentified, posing potential risks to security systems and forensic investigations. To address this problem, this paper innovatively introduces natural language descriptions to learn the global-local contexts for VI-ReID. Specifically, we design a framework that jointly optimizes visible-infrared alignment plus (VIAP) and visual-textual reasoning (VTR), and introduces local-global joint measure (LJM) to enhance the metric, while proposing a human-LLM collaborative approach to incorporate textual descriptions into existing cross-modal person re-identification datasets. VIAP achieves cross-modal alignment between RGB and IR. It can explicitly utilize designed frequency-aware modality alignment and relationship-reinforced fusion to explore the potential of local cues in global features and modality-invariant information. VTR proposes pooling selection and dual-level reasoning mechanisms to force the image encoder to pay attention to significant regions based on textual descriptions. LJM proposes introducing local feature distances into the measure stage metric to enhance the relevance of matching using fine-grained information. Extensive experimental results on the popular SYSU-MM01 and RegDB datasets show that the proposed method significantly outperforms state-of-the-art approaches. The dataset is publicly available at https://github.com/qyx596/vireid-caption.
Loading