Abstract: Recent advances in vision-language pre-trained models like CLIP have greatly enhanced general domain image-text retrieval performance. This success has led scholars to develop methods for applying CLIP to Specific Domain Image-Text Retrieval (SDITR) tasks such as Remote Sensing Image-Text Retrieval (RSITR) and Text-Image Person Re-identification (TIReID). However, these methods for SDITR often neglect two critical aspects: the enhancement of modal-level distribution consistency within the retrieval space and the reduction of CLIP's computational cost during inference, resulting in suboptimal retrieval spaces and unnecessarily high inference computational loads.
To address these issues, this paper presents a novel framework, Accurate and lightweight learning for specific domain Image-text Retrieval (AIR), based on the CLIP architecture. AIR incorporates a Modal-Level distribution Consistency Enhancement regularization (MLCE) loss and a Self-Pruning Distillation Strategy (SPDS) to improve retrieval precision and computational efficiency. The MLCE loss harmonizes the sample distance distributions within image and text modalities, fostering a retrieval space closer to the ideal state. Meanwhile, SPDS employs a strategic knowledge distillation process to transfer deep multimodal insights from CLIP to a shallower level, maintaining only the essential layers for inference, thus achieving model light-weighting.
Comprehensive experiments across various datasets in RSITR and TIReID demonstrate the effectiveness of both MLCE loss and SPDS. The study also explores the limits of SPDS's performance and compares it with conventional teacher-student distillation methods. The findings reveal that MLCE loss secures optimal retrieval on several datasets, while SPDS achieves a favorable balance between accuracy and computational demand during testing.
Primary Subject Area: [Engagement] Multimedia Search and Recommendation
Secondary Subject Area: [Content] Vision and Language, [Experience] Multimedia Applications
Relevance To Conference: In this paper, we focus on multimodal retrieval for specific domains, such as remote sensing image-text retrieval and text-image person re-identification. To address the issues of weak modality-level distribution consistency and redundant data flow during testing in existing methods, we propose a novel framework, Accurate and Lightweight learning for specific domain Image-text Retrieval (AIR). This framework not only improves cross-modal retrieval performance in specific domains but also significantly reduces the computational load during testing. Empirical evidence demonstrates that this study has achieved state-of-the-art retrieval performance across multiple domain-specific retrieval datasets, realizing both accurate and lightweight search capabilities.
Supplementary Material: zip
Submission Number: 3213
Loading