Dr. CLIP: CLIP-Driven Universal Framework for Zero-Shot Sketch Image Retrieval

Published: 20 Jul 2024, Last Modified: 21 Jul 2024MM2024 PosterEveryoneRevisionsBibTeXCC BY 4.0
Abstract: The field of Zero-Shot Sketch-Based Image Retrieval (ZS-SBIR) is currently undergoing a paradigm shift, transitioning from specialized models designed for individual tasks to more general retrieval models capable of managing various specialized scenarios. Inspired by the impressive generalization ability of the Contrastive Language-Image Pretraining (CLIP) model, we propose a CLIP-driven universal framework (Dr. CLIP), which leverages prompt learning to guide the synergy between CLIP and ZS-SBIR. Specifically, Dr. CLIP is a multi-branch network based on the CLIP image encoder and text encoder, which can perfectly cover four variants of ZS-SBIR tasks (inter-category, intra-category, cross-datasets, and generalization). Moreover, we decompose the synergy into classification learning, metric learning, and ranking learning, as well as introduce three key components to enhance learning effectiveness. i ) a forgetting suppression idea is applied to prevent catastrophic forgetting and constrains the feature distribution of the new categories in classification learning. ii ) a domain balanced loss is proposed to address sample imbalance and establish effective cross-domain correlations in metric learning. iii ) a pair-relation strategy is introduced to capture relevance and ranking relationships between instances in ranking learning. Eventually, we reorganize and redivide three coarse-grained datasets and two fine-grained datasets to accommodate the training settings for four ZS-SBIR tasks. The comparison experiments confirmed our method surpassed the state-of-the-art (SOTA) methods by a significant margin (1.95%~19.14%, mAP), highlighting its generality and superiority.
Primary Subject Area: [Engagement] Multimedia Search and Recommendation
Secondary Subject Area: [Content] Multimodal Fusion
Relevance To Conference: This work makes a significant contribution to the field of multimodal processing by embracing the paradigm shift in Zero-Shot Sketch-Based Image Retrieval (ZS-SBIR) and developing a more general retrieval model capable of managing various specialized scenarios. By integrating zero-shot learning techniques, sketch-based image retrieval algorithms, and contrastive language-image pretraining (CLIP) model, this research enhances the understanding and utilization of multimodal data, addressing the challenges and complexities inherent in multimedia processing. The proposed Dr. CLIP framework effectively leverages the complementary nature of different modalities, enabling more accurate and comprehensive analysis, interpretation, and retrieval of multimedia content. This work not only improves the performance of existing ZS-SBIR methods but also opens up new possibilities for applications such as “sketch-based searching of anything”. The findings and insights gained from this research have the potential to advance the field of multimodal processing, fostering innovation, and enabling more sophisticated and intelligent multimedia applications in various domains.
Submission Number: 1568
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview