Attend and Enrich: Enhanced Visual Prompt for Zero-Shot Learning

Man Liu, Huihui Bai, Feng Li, Chunjie Zhang, Yunchao Wei, Tat-Seng Chua, Yao Zhao

Published: 01 Jan 2025, Last Modified: 05 Nov 2025AAAI 2025EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Zero-shot learning (ZSL) endeavors to transfer knowledge from the seen categories to recognize unseen categories, which mostly relies on the semantic-visual interactions between image and attribute tokens. Recently, the prompt learning has emerged in ZSL and demonstrated significant potential as it allows the zero-shot transfer of diverse visual concepts to downstream tasks. However, current methods explore the fixed adaptation of the learnable prompt on the seen domains, which make them over-emphasize the primary visual features observed during training, limiting their generalization capabilities to the unseen domains. In this work, we propose AENet, which endows semantic information into the visual prompt to distill semantic-enhanced prompt for visual representation enrichment, enabling effective knowledge transfer for ZSL. AENet comprises two key steps: 1) exploring the concept-harmonized tokens for the visual and attribute modalities, grounded on the modal-sharing token that represents consistent visual-semantic concepts; and 2) yielding the semantic-enhanced prompt via the visual residual refinement unit with attribute consistency supervision. It is further integrated with primary visual features to attend to semantic-related information for visual enhancement, thus strengthening transferable ability. Experimental results on three benchmarks show that our AENet outperforms existing state-of-the-art ZSL methods.