Abstract: We address the long-standing problem of how to learn effective pixel-based image diffusion models at scale, introducing a remarkably simple greedy method for stable training of large-scale, high-resolution models. without the needs for cascaded super-resolution components.The key insight stems from careful pre-training of core components, namely, those responsible for text-to-image alignment vs. high resolution rendering. We first demonstrate the benefits of scaling a Shallow UNet, with no down(up)-sampling enc(dec)oder. Scaling its deep core layers is shown to improve alignment, object structure, and composition. Building on this core model, we propose a greedy algorithm that grows the architecture into high resolution end-to-end models, while preserving the integrity of the pre-trained representation,stabilizing training, and reducing the need for large high-resolution datasets. This enables a single stage model capable of generating high-resolution images without the need of a super-resolution cascade. Our key results rely on public datasets and show that we are able to train non-cascaded models up to 8B parameters with no further regularization schemes.Vermeer, our full pipeline model trained with internal datasets to produce 1024×1024 images, without cascades, is preferred by 44.0% vs. 21.4% human evaluators over SDXL.
Submission Length: Long submission (more than 12 pages of main content)
Previous TMLR Submission Url: https://openreview.net/forum?id=q5vK4MGPIK
Changes Since Last Submission: Removed teaser image from front page, as requested, so that the abstract now is immediately below the title.
Supplementary Material: pdf
Assigned Action Editor: ~Ran_He1
Submission Number: 2683
Loading