Abstract: Text-based person retrieval (TBPR) is a vision-language task that aims to find specific pedestrians in a large image gallery using the textual description. However, due to the heterogeneity between modalities and the redundancy in visual representations, it remains a challenging task. Existing methods do not explicitly reduce the influence of the background regions in images, inevitably decreasing representation ability and reducing the image-text matching performance. In this paper, we propose a novel framework for text-based person retrieval, termed Object-Centric Discriminative Learning (OCDL), which incorporates person masks to indicate attentive regions, thereby enhancing the model’s focus on the pedestrians in images while suppressing the background noise. Additionally, a novel crossmodal matching loss, namely Soft Angular Distribution Matching (SADM), is introduced to learn discriminative visual and textual representations. Extensive experiments on three widely-used TBPR datasets demonstrate the effectiveness of our approach. The code is available at https://github.com/JThuge/OCDL.
External IDs:dblp:conf/icassp/LiLSZ25
Loading