Discffusion: Discriminative Diffusion Models as Few-shot Vision and Language Learners

Xuehai He; Weixi Feng; Tsu-Jui Fu; Varun Jampani; Arjun Reddy Akula; Pradyumna Narayana; S Basu; William Yang Wang; Xin Eric Wang

Discffusion: Discriminative Diffusion Models as Few-shot Vision and Language Learners

Xuehai He, Weixi Feng, Tsu-Jui Fu, Varun Jampani, Arjun Reddy Akula, Pradyumna Narayana, S Basu, William Yang Wang, Xin Eric Wang

Published: 29 Aug 2024, Last Modified: 17 Sept 2024Accepted by TMLREveryoneRevisionsBibTeXCC BY 4.0

Abstract: Diffusion models, such as Stable Diffusion, have shown incredible performance on text-to-image generation. Since text-to-image generation often requires models to generate visual concepts with fine-grained details and attributes specified in text prompts, can we leverage the powerful representations learned by pre-trained diffusion models for discriminative tasks such as image-text matching? To answer this question, we propose a novel approach, Discriminative Stable Diffusion (Discffusion), which turns pre-trained text-to-image diffusion models into few-shot discriminative learners. Our approach uses the cross-attention score of a Stable Diffusion model to capture the mutual influence between visual and textual information and fine-tune the model via a new attention-based prompt learning to perform image-text matching. By comparing Discffusion with state-of-the-art methods on several benchmark datasets, we demonstrate the potential of using pre-trained diffusion models for discriminative tasks with superior results on few-shot image-text matching.

Submission Length: Regular submission (no more than 12 pages of main content)

Changes Since Last Submission: We have made the following updates to the paper: - Created a new Figure 2 and updated the figure captions. - Rewritten the method section, including the algorithm block. - Provided additional intuitions and explanations in the method section. - Reorganized the experiments section and added additional experiments. - Reorganized all tables, figures, and algorithms, enhancing the overall presentation of the paper. - Included additional experimental settings. - Revised the conclusion section. Overall, we have also thoroughly proofread the entire paper, improved the writing, provided additional clarifications, and included more details. All modifications are highlighted in blue.

Code: https://github.com/eric-ai-lab/Discffusion

Supplementary Material: zip

Assigned Action Editor: ~Yingzhen_Li1

Submission Number: 2556

Loading