Deep Residual Coupled Prompt Learning for Zero-Shot Sketch-Based Image Retrieval

Published: 2025, Last Modified: 23 Jan 2026IEEE Trans. Big Data 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Zero-shot sketch-based image retrieval (ZS-SBIR) aims to utilize freehand sketches for retrieving natural images with similar semantics in realistic zero-shot scenarios. Existing works focus on zero-shot semantic transfer using category word embedding and leveraging teacher-student networks to alleviate catastrophic forgetting of pre-trained models. They aim to retain rich discriminative features to achieve zero-shot semantic transfer. However, the category word embedding method is insufficient in flexibility, thereby limiting their retrieval performances in ZS-SBIR scenarios. In addition, the teacher network used for generating guidance signals results in computational redundancy, requiring repeated processing of mini-batch inputs. To address these issues, we propose a deep residual coupled prompt learning (DRCPL) for ZS-SBIR. Specifically, we leverage the text encoder of CLIP to generate category classification weights, thereby improving the flexibility and generality of zero-shot semantic transfer. To tune text and vision representations effectively, we introduce learnable prompts at the input and freeze the parameters of the CLIP encoder. This approach not only effectively prevents catastrophic forgetting, but also significantly reduces the computational complexity of the model. We also introduce the text-vision prompt coupling function to enhance the coordinated consistency between the text and vision representations, ensuring that the two branches can train collaboratively. Finally, we gradually establish stage feature relationships by learning prompts independently at different early stages to facilitate rich contextual learning. Comprehensive experimental results demonstrate that our DRCPL method achieves state-of-the-art performance in ZS-SBIR tasks.
Loading