CLIO: A Unified Framework for Consistency-Aware Learning and Intra-Modal Optimization in Text-Based Person Re-identification
Abstract: Text-based person retrieval aims to accurately locate a target individual from a large-scale image gallery based solely on a given textual description. Existing methods heavily rely on high-quality cross-modal annotations. However, in real-world scenarios, annotations are typically sourced from either manual labeling or large-scale language models, both of which inevitably introduce noise. Such noise can lead to semantic confusion between weakly aligned positive pairs and visually similar negative pairs. To address this challenge, we present a novel framework CLIO which employs consistency refining and mining and intra-modal optimization to address noisy correspondences in data annotation. We first introduce a Consistency Refinement and Mining (CRAM) module, which models intra-modal feature consistency across augmented views to estimate the reliability of each sample and distinguish between true and noisy correspondences. Additionally, to alleviate modality-specific representation degradation, we design a Cross-modal and Uni-modal Implicit Label Alignment (CUIA) module, which propagates confidence-aware labels across modality-specific and shared embedding spaces to enhance cross-modal alignment. Extensive experiments demonstrate that CLIO consistently achieves superior performance under noisy correspondence conditions, attaining a Rank-1 accuracy of 73.85% on the CUHK-PEDES dataset, with mINP and mAP improvements of 1.63% and 0.68%, respectively.
External IDs:dblp:conf/icic/YuanXZHGHL25
Loading