Abstract: Text-Based Person Search, which aims to retrieve target pedestrian images using natural language descriptions, has garnered significant attention in multimedia research due to its potential in suspect retrieval and missing person identification. While supervised and weakly supervised methods rely on costly annotated training data, unsupervised TBPS eliminates the need for textual descriptions or identity annotations, presenting a more practical paradigm. Current unsupervised TBPS approaches face two primary challenges: 1) Predefined attribute templates for caption generation limit linguistic diversity and real-world adaptability, and 2) Threshold-based sample selection using pre-trained vision-language models (VLMs) introduces noisy pairs due to inadequate pedestrian-specific representation. To address these limitations, we propose FACE, a unified framework featuring Dual-template Caption Generation (DCG) and Adaptive Curriculum Training (ACT). The DCG module generates high-quality captions through complementary flexible-style (natural language) and fixed-style (attribute-enumerated) templates, enhanced by LLM-based noise filtering. The ACT framework progressively refines training through a self-improving loop: initial high-confidence sample selection using VLMs bootstraps the model, while evolving feature representations enable dynamic incorporation of harder samples through curriculum learning. This dual strategy achieves mutual reinforcement between caption quality and model discriminability. Extensive experiments on CUHK-PEDES, ICFG-PEDES and RSTPReid datasets under unsupervised settings demonstrate that our framework achieves the state-of-the-art performance.
External IDs:doi:10.1145/3746027.3755315
Loading