Abstract: Highlights•Our model design combines both the advantages of prompt learning and adapter tuning.•Align CLIP visual and textual encoders with specific datasets via few-shot images.•Our model further enhances CLIP’s few-shot capability, obtaining superior results.
Loading