Learning Semantic-Aligned Feature Representation for Text-based Person SearchDownload PDF

10 Nov 2022OpenReview Archive Direct UploadReaders: Everyone
Abstract: Text-based person search aims to retrieve images of a certain pedestrian by a textual description. The key challenge of this task is to eliminate the inter-modality gap and achieve the fea- ture alignment across modalities. In this paper, we propose a semantic-aligned embedding method for text-based person search, in which the feature alignment across modalities is achieved by automatically learning the semantic-aligned vi- sual features and textual features. First, we introduce two Transformer-based backbones to encode robust feature rep- resentations of the images and texts. Second, we design a semantic-aligned feature aggregation network to adaptively select and aggregate features with the same semantics into part-aware features, which is achieved by a multi-head at- tention module constrained by a cross-modality part align- ment loss and a diversity loss. Experimental results on the CUHK-PEDES and Flickr30K datasets show that our method achieves state-of-the-art performances.
0 Replies

Loading