Cross-modal alignment with synthetic caption for text-based person search

Weichen Zhao, Yuxing Lu, Zhiyuan Liu, Yuan Yang, Ge Jiao

Published: 2025, Last Modified: 26 Jul 2025Int. J. Multim. Inf. Retr. 2025EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Text-based person search aims to retrieve target person from a large gallery based on natural language description. Existing methods take it as one-to-one embedding or many-to-many embedding matching problem. The former approach relies on the assumption of the existence of strong alignment between text and images, while the latter inevitably leads to issues of intra-class variation. Rather than being confined to these two approaches, we propose a new strategy that achieves cross-modal alignment with synthetic caption for joint image-text-caption optimization, named CASC. The core of this strategy lies in generating fine-grained captions that are informative for multimodal alignment. To realize this, we introduce two novel components: Granularity Awareness Sensor (GAS) and Conditional Contrastive Learning (CCL). GAS selects relative features through an innovative adaptive masking strategy, endowing the model with an enhanced perception of discriminative features. CCL aligns different modalities through further constraints on the synthetic captions by comparing the similarity of hard negative samples, protecting the disruption from noisy contents. With the incorporation of extra caption supervision, the model has access to learn more comprehensive feature representation, which in turn boosts the retrieval performance during inference. Experiments demonstrate that CASC outperforms existing state-of-the-art methods by 1.20%, 2.35% and 2.29% in terms of Rank@1 on CUHK-PEDES, ICFG-PEDES and RSTPReid datasets, respectively.