Mind the Inconsistent Semantics in Positive Pairs: Semantic Aligning and Multimodal Contrastive Learning for Text-Based Pedestrian Search

Published: 01 Jan 2024, Last Modified: 15 May 2025IEEE Trans. Inf. Forensics Secur. 2024EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Aiming at retrieving pedestrian images based on a provided textual description query, Text-Based Pedestrian Search (TBPS) has gained attention due to its implications in public security tasks such as suspect tracking. Nevertheless, the modality discrepancies between textual descriptions and visual images pose a challenge in aligning semantic information between these two modalities. Moreover, the text description annotated on a particular pedestrian image may not align with the content of other images sharing the same identity, due to variations in viewpoint. These text-image pairs exhibiting inconsistent semantics, termed weak positive pairs, have a discernible impact on the model’s performance. To address these challenges, we propose a Semantic Aligning and Multimodal Contrastive learning (SAMC) model to capture cross-modality identity-invariant features, including three modules: Multi-modality Features Fusion (MFF), Semantic-aligning Optimal Transport (SOT), and Multi-modality Contrastive Learning (MCL). Firstly, the MFF is designed to fuse textual and visual information and extract identity-discriminative multimodal features using self- and cross-attention mechanisms. The multimodal features act as anchors, bridging the gap between the two modalities and enhancing the identity-invariance of unimodal features. Secondly, the SOT is designed to address the semantic misalignment issue between textual descriptions and visual images. Utilizing the Optimal Transport (OT) theory, SOT encourages high features similarity between positive samples from different modalities, thereby exploring semantic relationships between image and text data without requiring extra supervised labels. Lastly, the MCL is introduced to narrow the modality gap, compelling two unimodal features towards the identity-discriminative multimodal features through contrastive learning. Different temperature coefficients are employed for strong and weak positive pairs to mitigate the inconsistency in text-image pair correlation. The effectiveness of SAMC is validated by extensive comprehensive experiments on three TBPS datasets.
Loading