Abstract: Text-Image Person Re-Identification (TIReID) is a computer vision task that involves identifying person in images or videos based on textual descriptions. Current works mainly employ Vision Language Pretrained (VLP) models for cross-modal alignment and to reestablish text-image fine-grained associations for re-identification. However, these methods neglect the variations in intra-class sample features, hindering accurate cross-modal alignment and discrimination of fine-grained features for the same person. In this paper, we propose a Cross-modal Intra-Class Learning (CICL) framework to enhance the model’s capability of learning fine-grained cross-modal features and intra-class sample variations. We propose an Image-Text Intra-class Relevance Learning (ITRL) method that considers sample relevance during text-image matching, boosting re-identification capability while maintaining fine-grained text-image alignment. Meanwhile, we propose a Bidirectional Masked Matching (BiMM) method that introduces masks to images and texts, prompting the model to attend to different content regions and establish finer-grained cross-modal associations. We test our approach using the CLIP model on three TIReID benchmarks, achieving results that surpass state-of-the-art performance on multiple metrics.
External IDs:doi:10.1007/978-981-96-1528-5_12
Loading