Discffusion: Discriminative Diffusion Models as Few-shot Vision and Language Learners

TMLR Paper2556 Authors

20 Apr 2024 (modified: 01 Jul 2024)Under review for TMLREveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Diffusion models, such as Stable Diffusion, have shown incredible performance on text-to-image generation. Since text-to-image generation often requires models to generate visual concepts with fine-grained details and attributes specified in text prompts, can we leverage the powerful representations learned by pre-trained diffusion models for discriminative tasks such as image-text matching? To answer this question, we propose a novel approach, Discriminative Stable Diffusion (Discffusion), which turns pre-trained text-to-image diffusion models into few-shot discriminative learners. Our approach uses the cross-attention score of a Stable Diffusion model to capture the mutual influence between visual and textual information and fine-tune the model via a new attention-based prompt learning to perform image-text matching. By comparing Discffusion with state-of-the-art methods on several benchmark datasets, we demonstrate the potential of using pre-trained diffusion models for discriminative tasks with superior results on few-shot image-text matching.
Submission Length: Regular submission (no more than 12 pages of main content)
Changes Since Last Submission: We have made the following updates to the paper: - Created a new Figure 2 and updated the figure captions. - Rewritten the method section, including the algorithm block. - Reorganized the experiments section and added additional experiments. - Included additional experimental settings. - Revised the conclusion section. Overall, we have also thoroughly proofread the entire paper, improved the writing, provided additional clarifications, and included more details. All modifications are highlighted in blue.
Assigned Action Editor: ~Yingzhen_Li1
Submission Number: 2556
Loading