ContraDiff: Unifying Training Process of Generative and Discriminative Vision Tasks in One Diffusion Model
Abstract: Besides unprecedented ability in image generation, text-to-image diffusion models are also able to provide powerful intermediate representations that support various discriminative vision tasks. However, efficiently adapting these models to handle both generative and discriminative tasks remains largely unexplored. While some unified frameworks have been proposed to reduce the overhead of training pipelines, they often rely on computationally expensive pretraining processes and lack flexibility in adaptation. In this paper, we propose ContraDiff, a novel framework to efficiently leverage a pretrained diffusion model for both generative and discriminative tasks. Our approach focuses on unified training and parameter-efficient optimization. Our framework combines a reconstruction loss and a contrastive loss on images with varying noise levels to effectively balance generative and contrastive training. Additionally, we apply LoRA to a pre-trained Stable Diffusion model, significantly reducing training time without compromising performance. Our experiments show that ContraDiff excels in both generative and discriminative vision tasks. Our model achieves 80.1\% accuracy on ImageNet-1K classification and an FID of 5.56 for ImageNet 256$\times$256 unconditional image generation, all while requiring significantly fewer trainable parameters. This efficiency offers advantages in computational resources and enhances the model's adaptability across a range of vision tasks. The code will be released publicly upon acceptance.
Submission Type: Regular submission (no more than 12 pages of main content)
Assigned Action Editor: ~Gabriel_Loaiza-Ganem1
Submission Number: 6793
Loading