Abstract: Retrieving specific person images based on textual descriptions, known as Text-to-Image Person Retrieval (TIPR), has emerged as a challenging research problem. While existing methods primarily focus on architectural refinements and feature representation enhancements, the critical aspect of textual description quality remains understudied. We propose a novel framework that automatically generates stylistically consistent textual descriptions to enhance TIPR generalizability. Specifically, we develop a dual-model architecture employing both captioning and retrieval models to quantitatively evaluate the impact of textual descriptions on retrieval performance. Comparative analysis reveals that manually annotated descriptions exhibit significant stylistic variations due to subjective biases among different annotators. To address this, our framework utilizes the captioning model to generate structurally consistent textual descriptions, enabling subsequent training and inference of the retrieval model based on automated annotations. Notably, our framework achieves a 18.60% improvement in Rank-1 accuracy over manual annotations on the RSTPReid dataset. We systematically investigate the impact of identity quantity during testing and explore prompt-guided strategy to enhance image caption quality. Furthermore, this paradigm ensures superior generalization capabilities for well-trained retrieval models. Extensive experiments demonstrate that our approach improves the applicability of TIPR systems.
Loading