ProFD: Prompt-Guided Feature Disentangling for Occluded Person Re-Identification

Published: 20 Jul 2024, Last Modified: 05 Aug 2024MM2024 PosterEveryoneRevisionsBibTeXCC BY 4.0
Abstract: To address the occlusion issues in person Re-Identification (ReID) tasks, many methods have been proposed to extract part features by introducing external spatial information. However, due to missing part appearance information caused by occlusion and noisy spatial information from external model, these purely vision-based approaches fail to correctly learn the features of human body parts from limited training data and struggle in accurately locating body parts, ultimately leading to misaligned part features. To tackle these challenges, we propose a Prompt-guided Feature Disentangling method (ProFD), which leverages the rich pre-trained knowledge in the textual modality facilitate model to generate well-aligned part features. ProFD first designs part-specific prompts and utilizes noisy segmentation mask to preliminarily align visual and textual embedding, enabling the textual prompts to have spatial awareness. Furthermore, to alleviate the noise from external masks, ProFD adopts a hybrid-attention decoder, ensuring spatial and semantic consistency during the decoding process to minimize noise impact. Additionally, to avoid catastrophic forgetting, we employ a self-distillation strategy, retaining pre-trained knowledge of CLIP to mitigate over-fitting. Evaluation results on the Market1501, DukeMTMC-ReID, Occluded-Duke, Occluded-ReID, and P-DukeMTMC datasets demonstrate that ProFD achieves state-of-the-art results.
Primary Subject Area: [Content] Vision and Language
Secondary Subject Area: [Engagement] Multimedia Search and Recommendation, [Content] Multimodal Fusion
Relevance To Conference: In this work, we propose a novel CLIP-based framework, Prompt-guided Feature Disentangling (ProFD), for occluded person ReID task. ProFD uses textual prompts to guide model to produce well-aligned and robust part features.We believe that augmenting textual modality information and leveraging multimodal pretraining knowledge can enhance the performance of the model on single-modal tasks, particularly in addressing the challenges of occluded person ReID (Person Re-Identification) tasks.
Supplementary Material: zip
Submission Number: 1964
Loading