Relation-aware Semantic Alignment Network for Text-to-Image Person Retrieval

Published: 2025, Last Modified: 09 Nov 2025ICASSP 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Text-to-Image Person Retrieval (TIPR) aims to utilize natural language descriptions as queries to retrieve pedestrian images. However, existing methods only concentrated on aligning individual text-image pairs and ignored the specific self-representations within both visible images and textual descriptions of the same identity. This neglects the impact of intra-modal information distribution on TIPR. In this paper, a novel Relation-aware Semantic Alignment Network (RSAN) is proposed to learn reliable and comprehensive semantic visual-textual associations across different modalities. Specifically, A Global Semantic Alignment Matching (GSAM) loss is introduced to enhance the coherence of inter-modality features while preserving intra-modal representations for cross-modal matching. Additionally, an Adapter-assisted Information Aggregation (AIA) module is designed to further complement contextual information fusion between the image features and text embeddings. Extensive experiments conducted on two public benchmark datasets demonstrate the superiority of the proposed RSAN.
Loading