Abstract: Text-to-image person retrieval aims to identify the desired individual based on a textual description. As an instance-level retrieval problem, it has a large intra-class variance and a small inter-class variance. Although significant progress has been made, the omni-granularity matching issue remains unaddressed. Omni-granularity matching involves aligning words with multi-granularity image regions, challenging models to learn in an omni-granularity embedding space. In this paper, we introduce a novel Omni-Granularity Embedding Network (OGEN) for person representation learning. It addresses the omni-granularity matching issue by developing a Cross-Granularity Aggregation Module (CGAM). This module dynamically consolidates diverse granularity features for learning granularity-dependent and omni-granularity person representations. Additionally, a teacher-student knowledge transfer framework is introduced to minimize the inter-modality discrepancy, allowing CGAM to focus on modality-shared semantics. Due to the effectiveness of CGAM and the knowledge transfer framework, our OGEN enhances the Rank-1 accuracy of the Baseline by 8.54%, 9.89%, and 11.09% on three public datasets, respectively.
Loading