Cross-Modal Semantic Alignment Learning for Text-Based Person Search

Published: 01 Jan 2024, Last Modified: 11 Jun 2024MMM (1) 2024EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Text-based person search aims to retrieve pedestrian images corresponding to a specific identity based on a textual description. Existing methods primarily focus on either the alignment of global features through well-designed loss functions or the alignment of local features via attention mechanisms. However, these approaches overlook the extraction of crucial local cues and incur high computational costs associated with cross-modality similarity scores. To address these limitations, we propose a novel Cross-Modal Semantic Alignment Learning approach (SAL), which effectively facilitates the learning of discriminative representations with efficient and accurate cross-modal alignment. Specifically, we devise a Token Clustering Learning module that excavates crucial clues by clustering visual and textual token features extracted from the backbone into fine-grained compact part prototypes, each of which is corresponding to a specific identity-related discriminative semantic. Furthermore, we introduce the optimal transport strategy to explicitly encourage the fine-grained semantic alignment of image-text part prototypes, achieving efficient and accurate cross-modal matching while largely reducing computational costs. Extensive experiments on two public datasets demonstrate the effectiveness and superiority of SAL for text-based person search.
Loading