Abstract: In recent years, Vision-Language Pre-training (VLP) models have demonstrated rich prior knowledge for multimodal alignment, prompting investigations into their application in Specific Domain Image-Text Retrieval (SDITR) such as Text-Image Person Re-identification (TIReID) and Remote Sensing Image-Text Retrieval (RSITR). Due to the unique data characteristics in specific scenarios, the primary challenge is to leverage discriminative fine-grained local information for improved mapping of images and text into a shared space. Current approaches interact with all multimodal local features for alignment, implicitly focusing on discriminative local information to distinguish data differences, which may bring noise and uncertainty. Furthermore, their VLP feature extractors like CLIP often focus on instance-level representations, potentially reducing the discriminability of fine-grained local features. To alleviate these issues, we propose an Explicit Key Local information Selection and Reconstruction Framework (EKLSR), which explicitly selects key local information to enhance feature representation. Specifically, we introduce a Key Local information Selection and Fusion (KLSF) that utilizes hidden knowledge from the VLP model to select interpretably and fuse key local information. Secondly, we employ Key Local segment Reconstruction (KLR) based on multimodal interaction to reconstruct the key local segments of images (text), significantly enriching their discriminative information and enhancing both inter-modal and intra-modal interaction alignment. To demonstrate the effectiveness of our approach, we conducted experiments on five datasets across TIReID and RSITR. Notably, our EKLSR model achieves state-of-the-art performance on two RSITR datasets.
Primary Subject Area: [Engagement] Multimedia Search and Recommendation
Secondary Subject Area: [Content] Vision and Language, [Experience] Multimedia Applications
Relevance To Conference: This work contributes to the field of multimedia/multimodal processing by exploring the application of Vision-Language Pre-training (VLP) models in specific domain image-text retrieval tasks, such as Text-Image Person Re-identification (TIReID) and Remote Sensing Image-Text Retrieval (RSITR). The primary contribution lies in addressing the challenge of leveraging discriminative fine-grained local information to improve the alignment of images and text in a shared space.
Specifically, the proposed Explicit Key Local Information Selection and Reconstruction Framework (EKLSR) tackles this challenge by explicitly selecting key discriminative local information to enhance feature representation. It introduces a Key Local Information Selection and Fusion (KLSF) method that leverages hidden knowledge from the VLP model to select and fuse key local information. This process strengthens the final feature representation and avoids the introduction of noise and uncertainty.
Additionally, the Key Local Segment Reconstruction (KLR) technique based on multimodal interaction is employed to reconstruct key local segments of images and texts. This significantly enriches the discriminative information of the local features and enhances both inter-modal and intra-modal alignment.
Furthermore, the proposed model utilizes a dual-stream inference framework, making it practical for application in specific scenarios.
Experimental results on six datasets across TIReID and RSITR demonstrate the effectiveness of the proposed approach, showing significant improvements in performance.
Supplementary Material: zip
Submission Number: 3899
Loading