Abstract: Text-to-image person retrieval (TPR) focuses on finding a specific person based on the textual description, and most methods implicitly assume the training image-text pairs are correctly aligned. In practice, the image-text pairs exist under-correlated or false-correlated due to the low quality of the images and annotation errors. Meanwhile, remarkable similarities between different person identities may lead to a mismatch between text and image. To tackle the two issues, we present a Visual-Language Noise Modeling (ViLNM) method that successfully captures robust cross-modal associations even with noise. Specifically, we design a Noise Token Aware (NTA) module that eliminates the words in the textual description that do not match the image, utilizing the matched words to establish a more reliable association. Besides, to enhance the recognition ability of the model for different person identities, we propose a Joint Inter and Intra-Modal Contrastive Loss (JII) and Local Aggregation (LA) module to increase the feature differences between different person identities. We conduct comprehensive experiments on three public benchmarks, and ViLNM performs best.
Loading