Abstract: Person search by natural language description aims to retrieve the most related person in the image gallery according to the given textual descriptions. This task is challenging due to the gap of cross-domain and cross-modality. Previous methods align the local visual-textual features based on the global matching score while ignoring capturing the fine-grained cross-modal correspondence between image and text. In this paper, we propose a novel framework named Enhanced Attributes Alignment based on Semantic Co-Attention (EAA-SCA) for text-based person search. The proposed SCA consists of Self-Attention (SA) modules and Relationships Attention (RA) modules cascaded in depth. SA module takes visual attribute features and textual features as input respectively to learn the internal dependencies of single modality. Then self-attended visual attribute features and self-attended textual features are feed into RA module to learn more fine-grained visual attribute features rich in semantic relationships between visual attributes and textual description, which contributes to more precise attributes alignment. Experimental results on CUHK-PEDES dataset demonstrate the effectiveness of the proposed method. With assistance from SCA, the performance improves 2.43% on Rank-1 in text-based person search.
External IDs:dblp:conf/cicai/WangH21
Loading