Abstract: In the vanilla zero-shot learning (ZSL) paradigm, category attributes is the key for knowledge generalizable transfer from seen to unseen classes. By contrast, the current contrastive language-image pretraining (CLIP) model relies on the category names to achieve a more general ZSL-like prediction. When vanilla ZSL meets general CLIP, however, most existing methods on both sides struggle to benefit from each other. In this brief, we resort to attribute prompt tuning (APT) for improving the knowledge transferability from the pretrained CLIP model to the downstream ZSL framework for pursuing desirable feature representations. Our approach, termed as attribute prompt alignment network (APAN), leverages APT for cross-network feature alignment (CFA). In this way, we can investigate the effects of CLIP to vanilla ZSL task in the era of large model by the two branch APAN architecture. Specifically, APT takes as an input the templates of class attribute descriptions to produce attribute prompts, which are further used to both guide the localizations of visual regions across two frozen feature extraction networks, through a visual-semantic interaction attention. This enables APAN to progressively refine and align these cross-network features, thus resulting in generalizable feature representations that can capture fine-grained attribute information. For CFA, we simply introduce prediction alignment loss that constrains the predictions from these two cross-network visual features. Experimental results on three benchmark datasets well demonstrate that APAN outperforms the state-of-the-art methods by absorbing generalizable knowledge from CLIP models.
External IDs:doi:10.1109/tnnls.2025.3598191
Loading