Abstract: Existing prompting strategies for adapting pretrained vision language models to the downstream task of finegrained attribute classification learn visual variance in a class-specific manner. We present DisPoL, a method for learning disentangled representations that improves the transferability and performance of continuous prompts for downstream classification tasks. Our method decomposes a prompt into separate sub-prompts, then performs late fusion of the corresponding output embeddings in a novel manner. We combine the fixed embedding of a static, context-constraining object sub-prompt and the tunable embedding of a soft sub-prompt for a task-specific attribute using self-attention. By avoiding joint learning of these tokens, the resulting disentangled prompt embeddings are more transferable to unseen objects. We also demonstrate how to use hand-crafted templates to initialize the task-specific soft prompt, improving training efficiency. Through extensive experiments, we show that DisPoL exceeds the performance of existing methods in few-shot settings and highlight its contribution as a parameter-efficient fine-tuning method.
Loading