On improved Conditioning Mechanisms and Pre-training Strategies for Diffusion Models

Published: 25 Sept 2024, Last Modified: 06 Nov 2024NeurIPS 2024 posterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Generative Models, Generative Modeling, Diffusion, Latent diffusion, Computer vision, text-to-image diffusion
Abstract: Large-scale training of latent diffusion models (LDMs) has enabled unprecedented quality in image generation. However, large-scale end-to-end training of these models is computationally costly, and hence most research focuses either on finetuning pretrained models or experiments at smaller scales. In this work we aim to improve the training efficiency and performance of LDMs with the goal of scaling to larger datasets and higher resolutions. We focus our study on two points that are critical for good performance and efficient training: (i) the mechanisms used for semantic level (\eg a text prompt, or class name) and low-level (crop size, random flip, \etc) conditioning of the model, and (ii) pre-training strategies to transfer representations learned on smaller and lower-resolution datasets to larger ones. The main contributions of our work are the following: we present systematic experimental study of these points, we propose a novel conditioning mechanism that disentangles semantic and low-level conditioning, we obtain state-of-the-art performance on CC12M for text-to-image at 512 resolution.
Primary Area: Diffusion based models
Flagged For Ethics Review: true
Submission Number: 20774
Loading