Abstract: Text-based person search aims to find person images that match a specific text description from person databases, and it has received a lot of attention from academia and industry in recent years. However, this task faces two challenges simultaneously: fine-grained retrieval and the heterogeneous gap between images and text. Some methods propose to use supervised attribute learning to extract attribute-related features, so as to correlate images and text at a fine-grained level. However, attribute labels are difficult to be obtained, which leads to poor performance of such approaches in practice. How to extract attribute-related features without attribute labeling and establish fine-grained cross-modal semantic association has become a critical problem to be solved. To solve this problem, this study proposes a text-based person search method based on virtual attribute learning by incorporating pre-training techniques, so as to establish fine-grained cross-modal semantic associations through unsupervised attribute learning. First, the study proposes a semantic-guided attribute decoupling method based on the invariance of person attributes and cross-modal semantic consistency, and the method uses the identity label of persons as a supervised signal to guide the model to decouple attribute-related features. Second, a semantic inference-based feature learning module is proposed based on the association between attributes to construct a semantic graph, and the module enhances the cross-modal recognition capability of features by exchanging information between attributes through graph models. Experiments are conducted on the public text-based person search dataset CUHK-PEDES and the cross-modal retrieval dataset Flickr30k, and the proposed method is compared with existing methods. The experimental results demonstrate the effectiveness of the proposed method.