Keywords: Text-based Person Search, Vision-Language Alignment, Semantic Gap, Modality Gap, Contrastive Learning, Noise Robustness
Abstract: Text-based Person Search (TBPS) faces two critical and intertwined challenges: the semantic gap caused by noisy image-text correspondences, and the modality gap stemming from the structural heterogeneity between dense visual features and sparse textual attributes. Crucially, we identify that this heterogeneity leads to intra-modal disorganization, where embeddings—particularly sparse text representations—lack internal structure, hindering robust alignment. To address these limitations, we propose Dual-Gaps Robust Aligned General Embedding (DRAGE) , a unified framework comprising two synergistic mechanisms. First, Semantic Embedding Reliability Division (SERD) dynamically partitions training data into reliable and noisy subsets using a Beta Mixture Model, providing clean supervision to mitigate the semantic gap. Second, Semantic Intra-Modality Regularization (SIMR) explicitly addresses intra-modal disorganization by enforcing semantic-aware distance constraints within each modality before cross-modal alignment. This transforms disorganized embeddings into coherent semantic clusters, establishing a stable geometric foundation for matching. Extensive experiments on CUHK-PEDES, ICFG-PEDES, and RSTPReid demonstrate that DRAGE significantly outperforms state-of-the-art methods, achieving 78.58% Rank-1 accuracy on CUHK-PEDES and exhibiting superior robustness in cross-domain settings.
Paper Type: Long
Research Area: Mathematical, Symbolic, Neurosymbolic, and Logical Reasoning
Research Area Keywords: Vision and Language, Retrieval
Contribution Types: NLP engineering experiment
Languages Studied: Enhlish
Submission Number: 1889
Loading