Abstract: Addressing the disparity in description granularity and information gap between images and text has long been a formidable challenge in text-based person retrieval (TBPR) tasks. Recent researchers tried to solve this problem by random local alignment. However, they failed to capture the fine-grained relationships between images and text, so the information and modality gaps remain on the table. We align image regions and text phrases at the same semantic granularity to address the semantic atomicity gap. Our idea is first to extract and then exploit the relationships between fine-grained locals. We introduce a novel Fine-grained Semantic Alignment with Transferred Person-SAM (SAP-SAM) approach. By distilling and transferring knowledge, we propose a Person-SAM model to extract fine-grained semantic concepts at the same granularity from images and texts of TBPR and its relationships. With the extracted knowledge, we optimize the fine-grained matching via Explicit Local Concept Alignment and Attentive Cross-modal Decoding to discriminate fine-grained image and text features at the same granularity level and represent the important semantic concepts from both modalities, effectively alleviating the granularity and information gaps. We evaluate our proposed approach on three popular TBPR datasets, demonstrating that SAP-SAM achieves state-of-the-art results and underscores the effectiveness of end-to-end fine-grained local alignment in TBPR tasks.
Primary Subject Area: [Content] Vision and Language
Secondary Subject Area: [Content] Vision and Language, [Experience] Multimedia Applications
Relevance To Conference: Nowadays, deep learning methods have shown a trend towards fine-grained multimodality, so aligning the many detailed semantics present in multimodal information is a problem that researchers must face. Our approach focuses on text-based pedestrian retrieval tasks closely related to multimodal processing. Its goal is to retrieve the corresponding pedestrian in the image library according to the given text description. Based on the characteristics of this task, we propose a new fine-grained relationship excavation method. We designed Person-SAM to extract the local correspondence between text phrases and fine-grained images and completed the training using data migration. Next, we use this relationship for the first time and improve the understanding ability of the model for fine-grained features through the matching learning of local detailed features and the region mask. Our approach is the first attempt to simultaneously align all modal fine-grained detail information. We throw light on this and hope it will inspire future research to achieve this goal more efficiently and think about how to accomplish it in more modalities and fine-grained tasks.
Supplementary Material: zip
Submission Number: 4623
Loading