Abstract: Current state-of-the-art diffusion models employ U-Net architectures containing convolutional and (qkv) self-attention layers. The U-Net processes images while being conditioned on the time embedding input for each sampling step and the class or caption embedding input corresponding to the desired conditional generation. Such conditioning involves scale-and-shift operations to the convolutional layers but does not directly affect the attention layers. While these standard architectural choices are certainly effective, not conditioning the attention layers feels arbitrary and potentially suboptimal. In this work, we show that simply adding LoRA conditioning to the attention layers without changing or tuning the other parts of the U-Net architecture improves the image generation quality. For example, a drop-in addition of LoRA conditioning to EDM diffusion model yields FID scores of 1.91/1.75 for unconditional and class-conditional CIFAR-10 generation, improving upon the baseline of 1.97/1.79.
Submission Length: Regular submission (no more than 12 pages of main content)
Changes Since Last Submission: - Modified the format to camera-ready version
- Modified minor citation style issues
- Minor improvement on FFHQ results
Video: https://youtube.com/watch?v=z-1LE2aNHak
Code: https://github.com/lthilnklover/diffusion_lora
Assigned Action Editor: ~Jakub_Mikolaj_Tomczak1
Submission Number: 2628
Loading