Abstract: Prompt learning has emerged as a powerful paradigm for adapting vision-language models such as CLIP to downstream tasks.
However, existing methods often overfit to seen data, leading to
significant performance degradation when generalizing to novel
classes or unseen domains. To address this limitation, we propose
DiSa, a Directional Saliency-Aware Prompt Learning framework
that integrates two complementary regularization strategies to enhance generalization. First, our Cross-Interactive Regularization
(CIR) fosters cross-modal alignment by enabling cooperative learning between prompted and frozen encoders. Within CIR, a saliencyaware masking strategy guides the image encoder to prioritize
semantically critical image regions, reducing reliance on less informative patches. Second, we introduce a directional regularization
strategy that aligns visual embeddings with class-wise prototype
features in a directional manner to prioritize consistency in feature
orientation over strict proximity. This approach ensures robust
generalization by leveraging stable prototype directions derived
from class-mean statistics. Extensive evaluations on 11 diverse image classification benchmarks demonstrate that DiSa consistently
outperforms state-of-the-art prompt learning methods across various settings, including base-to-novel generalization, cross-dataset
transfer, domain generalization, and few-shot learning.
Loading