Abstract: Training prompt tuning models on task-specific data is a common method for adapting vision-language model knowledge to image recognition downstream tasks. Despite recent advancements in prompt tuning, achieving superior generalization to heterogeneous images, across a wide range of visual characteristics in style, format, and source, remains a significant challenge. To this end, we propose a novel method, namely Self-generated Cross-modal Prompt tuning (SCP), which generates pseudo prompts by applying the frozen knowledge in both the initialization and optimization stages to guide training. Consequently, the model can be trained on available datasets while effectively generalizing to heterogeneous image data in a wide spectrum of textual classes and visual characteristics. Extensive experiments on four benchmarks indicate that our proposed SCP significantly outperforms well-known baselines in generalization performance across a broad spectrum of downstream tasks. Notably, our proposed SCP exhibits significant improvements in both Cross-Dataset and Domain-Shift Generalization, with performance gains of at least 3.63% and 11.71%, respectively. Our code is available at https://github.com/Ghosttimber/Academic.
External IDs:dblp:conf/pkdd/CaoWHOX25
Loading