Prototypical Prompting for Text-to-image Person Re-identification

Published: 20 Jul 2024, Last Modified: 21 Jul 2024MM2024 PosterEveryoneRevisionsBibTeXCC BY 4.0
Abstract: In this paper, we study the problem of Text-to-Image Person Re-identification (TIReID), which aims to find images of the same identity described by a text sentence from a pool of candidate images. Benefiting from Vision-Language Pre-training, such as CLIP (Contrastive Language-Image Pretraining), the TIReID techniques have achieved remarkable progress recently. However, most existing methods only focus on instance-level matching and ignore identity-level matching, which involves associating multiple images and texts belonging to the same person. In this paper, we propose a novel prototypical prompting framework (Propot) designed to simultaneously model instance-level and identity-level matching for TIReID. Our Propot transforms the identity-level matching problem into a prototype learning problem, aiming to learn identity-enriched prototypes. Specifically, Propot works by ‘initialize, adapt, enrich, then aggregate’. We first use CLIP to generate high-quality initial prototypes. Then, we propose a domain-conditional prototypical prompting (DPP) module to adapt the prototypes to the TIReID task using task-related information. Further, we propose an instance-conditional prototypical prompting (IPP) module to update prototypes conditioned on intra-modal and inter-modal instances to ensure prototype diversity. Finally, we design an adaptive prototype aggregation module to aggregate these prototypes, generating final identity-enriched prototypes. With identity-enriched prototypes, we diffuse its rich identity information to instances through prototype-to-instance contrastive loss to facilitate identity-level matching. Extensive experiments conducted on three benchmarks demonstrate the superiority of Propot compared to existing TIReID methods.
Primary Subject Area: [Engagement] Multimedia Search and Recommendation
Secondary Subject Area: [Engagement] Multimedia Search and Recommendation, [Experience] Multimedia Applications, [Content] Vision and Language
Relevance To Conference: This paper deals with the task of text-to-image person re-identification (TIReID), and proposes a prototypical prompting framework (Propot) to simultaneously model instance-level and identity-level matching between images and texts. Overall, we transform the identity-level matching problem into a prototype learning problem, which aims to learn identity-enriched prototypes. We design an “initialize, adapt, enrich, then aggregate” pipeline to aggregate images or texts under the same identity/category to generate modality-specific prototypes, and then align them to model identity/class-level cross-modal matching. We believe that this contribution is also applicable to some related multimedia tasks such as image-text matching and video-text retrieval.
Submission Number: 2842
Loading