Multi-granularity relation-aware and conditional query learning for text-based person search

Published: 01 Jan 2025, Last Modified: 15 May 2025J. Electronic Imaging 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Text-based person search (TBPS) focuses on matching specific person images with the provided natural language queries. It plays a significant role in intelligent monitoring systems due to its open and convenient query format. The core of TBPS is to extract multi-modal features utilizing different encoders to elaborately capture the latent relationships between image parts and relevant words. However, existing methods typically utilize independent encoders to learn features from different modalities and explore global similarities, which leads to the introduction of noise from similar textual expressions. These weak positive samples (similar textual sentences) require more refined relation inference for cross-modal retrieval tasks. In addition, previous works have focused on matching different regions of person images, which under-utilizes the more discriminative conditional information in language queries. To address these problems, we propose a multi-granularity relation-aware and conditional queries learning network for TBPS, which extracts relation-aware features with conditional language queries in a multi-granularity manner. Specifically, the conditional query mechanism is designed to sufficiently utilize the conditional information in language queries. Meanwhile, a multi-granularity relation-aware network extracts features from different modalities and explores strong relationships between words and images. Comparative experiments with existing works on three publicly available datasets are constructed to demonstrate the effectiveness and superiority of our proposed network.
Loading