COP

Jiahao Zheng, Yu Tang, Yongcan Luo, Ning Chen, Dan Zeng, Dapeng Wu

Published: 15 Jan 2026, Last Modified: 14 Mar 2026IEEE Transactions on MultimediaEveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Zero-shot Sketch-based Image Retrieval (ZS-SBIR) is a challenging yet rewarding task, as it demands models to possess both brain- like zero-shot learning and cross-view alignment capabilities. Recent advances suggest that powerful pre-trained vision encoders, such as CLIP, offer a promising alternative for addressing the ZS-SBIR task. However, the problem of simultaneously evoking the zero-shot learning capability and cross-view alignment capability of pre-trained vision encoders has barely been discussed. To this end, we propose the CrOss-view Attention Prompt (COP) framework, which is composed of an Attention Prompt module and a Cross-view Query module. Specifically, we formulate prompt construction as a retrieval problem by introducing a prompt pool and attention mechanism, thereby constructing attention prompts with fine granularity to enhance the zero-shot learning capability. Furthermore, to endow COP with cross-view alignment capabilities, we replace single-view queries with carefully designed cross-view queries, which can be smoothly inserted into the Attention Prompt module. The proposed COP is scenario-agnostic and supports vision encoders with diverse pre-training schemes. Comprehensive experiments show that COP achieves competitive performance in ZS-SBIR, Generalized ZS-SBIR, and Cross-data ZS-SBIR scenarios, regardless of whether it is based on the ImageNet pre-trained vision encoder or the CLIP pre-trained vision encoder. © 1999-2012 IEEE.

External IDs:doi:10.1109/tmm.2026.3654420