Semantic Entity Alignment and Non-Corresponding Reasoning for Text-to-Image Person Re-Identification
Abstract: With the rapid development of intelligent surveillance technology, the massive amount of multimodal data (e.g., videos, images, and text) has imposed higher demands on efficient information retrieval and security. Traditional single-modal retrieval methods struggle to meet practical requirements, making multimodal image-text retrieval a research hotspot in this field. Existing approaches, however, still face challenges in fine-grained semantic alignment and suffer from rigid matching mechanisms. To address these issues, this paper introduces SeaNcr, a novel framework that integrates cross-modal semantic entity alignment with non-correspondence reasoning. Our method constructs class-level entity representations enhanced by saliency-guided masking to capture discriminative semantic features. A pseudo-frozen asynchronous optimization strategy is introduced to maintain semantic consistency across modalities by associating stable entity representations with dynamically updated encoder features. Moreover, to overcome rigid matching, we design a non-correspondence reasoning module that jointly leverages intra-modal similarity and cross-modal mutual nearest neighbor constraints, optimizing matching flexibility and generalization. Extensive experiments validate that SeaNcr significantly enhances cross-modal feature representation and retrieval robustness, achieving state-of-the-art performance on multiple person re-identification benchmarks.
External IDs:dblp:journals/tifs/PengCLSC26
Loading