Keywords: Zero-Shot Generative Model Adaptation, Multimodal Representation Space, Transfer Learning, Prompt Learning
Abstract: Zero-shot generative model adaptation (ZSGM) aims to adapt pre-trained generative models using only textual descriptions. ZSGM is particularly valuable for data-scarce target domains, such as rare concepts or artistic styles, where obtaining training samples is challenging. Central to all existing ZSGM methods is the foundational assumption that image-text offsets in CLIP's multimodal representation space are well aligned to guide adaptation.
**In this work**, we present two main contributions. First, we question this foundational assumption by conducting the first comprehensive empirical analysis of image-text offset alignment in CLIP space within the ZSGM context. Our findings reveal not only noticeable misalignment but also a meaningful positive correlation between image-text offset misalignment and concept distance across six large datasets and four multimodal spaces. Second, leveraging this discovery, we propose Adaptation with Iterative Refinement (AIR), the first method focused on improving sample quality for ZSGM.
Our method iteratively refines text offsets and reduces image-text offset misalignment,
using anchor sampling and a novel prompt learning approach. Comprehensive experiments accross **32** experiment setups, including qualitative, quantitative, and user studies, consistently show that AIR achieves state-of-the-art performance.
**Code and additional experiments are available in the supplementary material.**
Primary Area: generative models
Submission Number: 990
Loading