Implicit Alignment-Based Cross-Modal Symbiotic Network for Text-to-Image Person Re-Identification

Published: 2025, Last Modified: 20 Jan 2026IEEE Trans. Inf. Forensics Secur. 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Text-to-image person re-identification aims to utilize textual descriptions to retrieve specific person images from large image databases. The core challenge of this task lies in the significant feature differences between the abstract nature of text and the intuitiveness of images. Existing solutions primarily rely on explicit alignment of global or fine-grained local features, which lack flexibility and struggle to effectively capture and leverage subtle features and relationship information in multimodal data. Particularly, for different images of the same person, the emphasis in feature extraction should be adjusted according to the differences in text descriptions. To address these issues, this paper proposes a Cross-Modal Symbiotic Network (CMSN) based on implicit alignment. First, CMSN employs an Implicit Multi-scale Feature Integration (IMFI) module to implicitly extract and fuse multi-scale features from images and text, thereby adaptively capturing the feature relationships between the two modalities. Second, a Combined Representation Learning (CRL) module is used to produce a combined representation of the text and image features, utilizing a Combined-Representation Identity Alignment (CRIA) loss to align and constrain the identity centers of the three feature vectors. Finally, we design a Semi-Positive Triplet (SPT) loss function, which defines semi-positive samples using other images and texts of the same identity, providing additional supervisory information to the model and further reducing modality heterogeneity. Extensive experiments on the CUHK-PEDES dataset demonstrate that CMSN achieves an impressive Rank-1 and mAP accuracy of 76.46% and 70.28%, respectively, significantly outperforming existing SOTA methods.
Loading