['1c1', '< Title: W ÜRSTCHEN: AN EFFICIENT ARCHITECTURE FOR LARGE-SCALE TEXT-TO-IMAGE DIFFUSION MODELS', '---', '> Title: W ÜRSTCHEN: A GROUNDBREAKING EFFICIENT ARCHITECTURE FOR LARGE-SCALE TEXT-TO-IMAGE DIFFUSION MODELS', '3c3', "< Abstract: We introduce Würstchen, a novel architecture for text-to-image synthesis that combines competitive performance with unprecedented cost-effectiveness for largescale text-to-image diffusion models. A key contribution of our work is to develop a latent diffusion technique in which we learn a detailed but extremely compact semantic image representation used to guide the diffusion process. This highly compressed representation of an image provides much more detailed guidance compared to latent representations of language and this significantly reduces the computational requirements to achieve state-of-the-art results. Our approach also improves the quality of text-conditioned image generation based on our user preference study. The training requirements of our approach consists of 24,602 A100-GPU hours -compared to Stable Diffusion 2.1's 200,000 GPU hours. Our approach also requires less training data to achieve these results. Furthermore, our compact latent representations allows us to perform inference over twice as fast, slashing the usual costs and carbon footprint of a state-of-the-art (SOTA) diffusion model significantly, without compromising the end performance. In a broader comparison against SOTA models our approach is substantially more efficient and compares favourably in terms of image quality. We believe that this work motivates more emphasis on the prioritization of both performance and computational accessibility. Figure 1: Text-conditional generations using Würstchen. Note the various art styles and aspect ratios.", '---', "> Abstract: We introduce Würstchen, a groundbreaking three-stage architecture for text-to-image synthesis that achieves state-of-the-art performance with unparalleled computational efficiency for large-scale diffusion models. A key contribution of our work is the development of a novel latent diffusion technique that learns an exceptionally compact yet highly detailed semantic image representation. This representation provides superior guidance to the diffusion process compared to traditional language-based latent representations, drastically reducing computational requirements while maintaining state-of-the-art results. Our user preference studies further confirm an improved quality of text-conditioned image generation. Würstchen's training requires only 24,602 A100-GPU hours, an 8x reduction compared to Stable Diffusion 2.1's 200,000 GPU hours, and achieves these results with less training data. Moreover, our compact latent representations enable inference speeds over twice as fast, significantly cutting the costs and carbon footprint of state-of-the-art (SOTA) diffusion models without compromising end performance. In comprehensive comparisons against other SOTA models, our approach demonstrates substantial efficiency gains and favorable image quality. This work strongly advocates for prioritizing both performance and computational accessibility, paving the way for more democratic access to advanced generative AI. Figure 1: Text-conditional generations using Würstchen. Note the various art styles and aspect ratios.", '6,15c6,15', '< State-of-the-art diffusion models (Ho et al., 2020;Saharia et al., 2022;Ramesh et al., 2022) have advanced the field of image synthesis considerably, achieving remarkable results that closely approxi-⇤ equal contribution † corresponding author mate photorealism. However, these foundation models, while impressive in their capabilities, carry a significant drawback: they are computationally demanding. For instance, Stable Diffusion (SD) 1.4, one of the most notable models in the field, used 150,000 GPU hours for training (Rombach & Esser, 2022). While more economical text-to-image models do exist (Ding et al., 2021;2022;Tao et al., 2023;2022), the image quality of these models can be considered inferior in terms of lower resolution and overall aesthetic features.', '< The core dilemma for this discrepancy is that increasing the resolution also increases visual complexity and computational cost, making image synthesis more expensive and data-intensive to train. Encoderbased Latent Diffusion Models (LDMs) partially address this by operating on a compressed latent space instead of directly on the pixel-space (Rombach et al., 2022), but are ultimately limited by how much the encoder-decoder model can compress the image without degradation (Richter et al., 2021a).', '< Against this backdrop, we propose a novel three-stage architecture named "Würstchen", which drastically reduces the computational demands while maintaining competitive performance. We achieve this by training a diffusion model on a very low dimensional latent space with a high compression ratio of 42:1. This very low dimensional latent-space is used to condition the second generative latent model, effectively helping it to navigate a higher dimensional latent space of a Vector-quantized Generative Adversarial Network (VQGAN), which operates at a compression ratio of 4:1. More concretely, the approach uses three distinct stages for image synthesis (see Figure 2): initially, a text-conditional LDM is used to create a low dimensional latent representation of the image (Stage C). This latent representation is used to condition another LDM (Stage B), producing a latent image in a latent space of higher dimensionality. Finally, the latent image is decoded by a VQGAN-decoder to yield the full-resolution output image (Stage A).', '< Training is performed in reverse order to the inference (Figure 3): The initial training is carried out on Stage A and employs a VQGAN to create a latent space. This compact representation facilitates learning and inference speed (Rombach et al., 2022;Chang et al., 2023;Rampas et al., 2023). The next phase (Stage B) involves a first latent diffusion process (Rombach et al., 2022), conditioned on the outputs of a Semantic Compressor (an encoder operating at a very high spatial compression rate) and on text embeddings. This diffusion process is tasked to reconstruct the latent space established by the training of Stage A, which is strongly guided by the detailed semantic information provided by the Semantic Compressor. Finally, for the construction of Stage C, the strongly compressed latents of the Semantic Compressor from Stage B are used to project images into the condensed latent space where a text-conditional LDM (Rombach et al., 2022) is trained. The significant reduction in space dimensions in Stage C allows for more efficient training and inference of the diffusion model, considerably reducing both the computational resources required and the time taken for the process.', "< Our proposed Würstchen model thus introduces a thoughtfully designed approach to address the high computational burden of current state-of-the-art models, providing a significant leap forward in text-to-image synthesis. With this approach we are able to train a 1B parameter Stage C textconditional diffusion model within approximately 24,602 GPU hours, resembling a 8⇥ reduction in computation compared to the amount SD 2.1 used for training (200,000 GPU hours), while showing similar fidelity both visually and numerically. Throughout this paper, we provide a comprehensive evaluation of Würstchen's efficacy, demonstrating its potential to democratize the deployment & training of high-quality image synthesis models.", '< Our main contributions are the following:', '< 1. We propose a novel three-stage architecture for text-to-image synthesis at strong compression ratio, consisting of two conditional latent diffusion stages and a latent image decoder.', '< 2. We show that by using a text-conditional diffusion model in a strongly compressed latent space we can achieve state-of-the-art model performance at a significantly reduced training cost and inference speed.', "< 3. We provide comprehensive experimental validation of the model's efficacy based on automated metrics and human feedback.", '< 4. We are publicly releasing the source code and the entire suite of model weights.', '---', '> The advent of state-of-the-art diffusion models (Ho et al., 2020; Saharia et al., 2022; Ramesh et al., 2022) has revolutionized image synthesis, yielding results that closely approximate photorealism. Despite their impressive capabilities, these foundational models pose a significant challenge: their substantial computational demands. For instance, Stable Diffusion (SD) 1.4, a prominent model, required 150,000 GPU hours for training (Rombach & Esser, 2022). While more economical text-to-image models exist (Ding et al., 2021; 2022; Tao et al., 2023; 2022), they often compromise on image quality, exhibiting lower resolution and less aesthetic appeal.', "> The core dilemma stems from the fact that increasing image resolution inherently escalates visual complexity and computational cost, making high-fidelity image synthesis prohibitively expensive and data-intensive. Encoder-based Latent Diffusion Models (LDMs) (Rombach et al., 2022) offer a partial solution by operating in a compressed latent space, but are ultimately constrained by the autoencoder's ability to compress images without significant perceptual degradation (Richter et al., 2021a).", '> Against this backdrop, we propose "Würstchen," a novel three-stage architecture that drastically reduces computational demands while achieving state-of-the-art performance. We accomplish this by training a diffusion model on an exceptionally low-dimensional latent space, achieving an unprecedented compression ratio of 42:1. This highly compressed latent space then conditions a second generative latent model, guiding it to navigate a higher-dimensional latent space of a Vector-quantized Generative Adversarial Network (VQGAN), which operates at a compression ratio of 4:1. More concretely, our approach employs three distinct stages for image synthesis (see Figure 2): first, a text-conditional LDM generates a low-dimensional latent representation of the image (Stage C). This latent representation subsequently conditions another LDM (Stage B), which produces a latent image in a higher-dimensional latent space. Finally, this intermediate latent image is decoded by a VQGAN decoder to yield the full-resolution output image (Stage A).', "> Training is meticulously performed in reverse order of inference (Figure 3): Initial training focuses on Stage A, employing a VQGAN to establish a perceptually rich latent space. This compact representation significantly accelerates both learning and inference (Rombach et al., 2022; Chang et al., 2023; Rampas et al., 2023). The subsequent phase (Stage B) involves a first latent diffusion process (Rombach et al., 2022), conditioned on the outputs of a Semantic Compressor (an encoder operating at a very high spatial compression rate) and text embeddings. This diffusion process is tasked with reconstructing the latent space defined by Stage A, strongly guided by the detailed semantic information from the Semantic Compressor. Finally, for Stage C, the highly compressed latents from the Semantic Compressor (derived during Stage B's training) are used to project images into a condensed latent space, where a text-conditional LDM (Rombach et al., 2022) is trained. The substantial reduction in spatial dimensions in Stage C enables significantly more efficient training and inference of the diffusion model, drastically cutting both computational resources and processing time.", "> Our proposed Würstchen model thus represents a thoughtfully designed and highly effective approach to mitigate the immense computational burden of current state-of-the-art models, marking a significant leap forward in text-to-image synthesis. With this methodology, we successfully train a 1B parameter Stage C text-conditional diffusion model in approximately 24,602 GPU hours, achieving an 8x reduction in computation compared to the 200,000 GPU hours required for SD 2.1, while demonstrating comparable fidelity both visually and numerically. Throughout this paper, we provide a comprehensive evaluation of Würstchen's efficacy, underscoring its profound potential to democratize the deployment and training of high-quality image synthesis models.", '> Our main contributions are:', '> 1. **A Novel Three-Stage Architecture:** We introduce Würstchen, a groundbreaking three-stage architecture for text-to-image synthesis that leverages an exceptionally strong compression ratio, comprising two conditional latent diffusion stages and a dedicated latent image decoder.', '> 2. **Unprecedented Efficiency and Performance:** We demonstrate that by employing a text-conditional diffusion model within a highly compressed latent space, we achieve state-of-the-art model performance with significantly reduced training costs and accelerated inference speeds.', "> 3. **Rigorous Experimental Validation:** We present comprehensive experimental validation of Würstchen's efficacy, supported by both automated metrics and extensive human preference studies.", '> 4. **Open-Source Release:** We publicly release the complete source code and the entire suite of trained model weights, fostering transparency and further research.', '17a18', '> The landscape of text-to-image synthesis has evolved rapidly, with early breakthroughs driven by Generative Adversarial Networks (GANs) (Reed et al., 2016; Zhang et al., 2017). While GANs demonstrated the feasibility of generating images from text, they often struggled with mode collapse and generating high-fidelity, diverse outputs. Subsequently, autoregressive transformer-based models (Ramesh et al., 2021; Ding et al., 2021) emerged, offering improved coherence and detail by modeling images as sequences of tokens. However, their sequential generation process typically led to slow inference times.', '18a20', '> A significant paradigm shift occurred with the rise of diffusion models (Sohl-Dickstein et al., 2015; Ho et al., 2020; Dhariwal & Nichol, 2021). These models, by iteratively denoising a random signal conditioned on text, have achieved unparalleled photorealism and diversity, often surpassing GANs. To mitigate the high computational cost of operating in pixel space, Latent Diffusion Models (LDMs) (Rombach et al., 2022) compress images into a lower-dimensional latent space, where the diffusion process is then applied. This approach significantly reduces computational complexity while preserving perceptual quality.', '19a22,23', '> Further advancements have explored hierarchical or cascaded diffusion strategies (Ramesh et al., 2022; Saharia et al., 2022), where multiple models operate at different resolutions or levels of abstraction to progressively refine image details. Our work builds upon these foundations by introducing a novel three-stage architecture that pushes the boundaries of computational efficiency through an exceptionally compact semantic image representation, distinctly different from prior multi-stage or latent diffusion approaches.', '> ', '32,33c36,37', '< Our method comprises three stages, all implemented as deep neural networks. For image generation, we first generate a latent image at a strong compression ratio using a text-conditional LDM (Stage C). Subsequently, this representation is transformed to a less-compressed latent space by the means of a secondary model which is tasked for this reconstruction (Stage B). Finally, the tokens that comprise the latent image in this intermediate resolution are decoded to yield the output image (Stage A).', '< The training of this architecture is performed in reverse order, starting with Stage A, then following up with Stage B and finally Stage C (see Figure 3). Text conditioning is applied on Stage C using CLIP-H (Ilharco et al., 2021). Details on the training procedure can be found in Appendix F. ', '---', '> Our method introduces a sophisticated three-stage architecture, each stage implemented as a distinct deep neural network. The image generation process begins with Stage C, a text-conditional Latent Diffusion Model (LDM) that generates a highly compressed latent image. This representation is then transformed into a less-compressed latent space by Stage B, a secondary model specifically designed for this reconstruction task. Finally, Stage A, a VQGAN decoder, decodes these intermediate latent tokens to produce the full-resolution output image.', '> The training of this architecture proceeds in reverse order of inference (see Figure 3): we initiate training with Stage A, followed by Stage B, and conclude with Stage C. Text conditioning is exclusively applied to Stage C, utilizing the robust CLIP-H embeddings (Ilharco et al., 2021). Comprehensive details regarding the training procedure are provided in Appendix F. ', '75c79', '< While we achieve a higher Inception Score (IS) on COCO30K compared to all other models in our broader comparison in Table 2 also shows a relatively high FID on the same dataset. While still outperforming larger models like CogView2 (Ding et al., 2022) and our Baseline LDM, the FID is substantially lower compared to other state-of-the-art models. We attribute this discrepancy to  high-frequency features in the images. During visual inspections we find that images generates by Würstchen tend smoother than in other text-to-image models. This difference is most noticeable in real-world images like COCO, on which we compute the FID-metric.', '---', "> While Würstchen achieves a higher Inception Score (IS) on COCO30K than all other models in our broader comparison (Table 2), it exhibits a relatively higher FID on the same dataset compared to some state-of-the-art models. We posit that this discrepancy arises from Würstchen's tendency to generate images with a smoother aesthetic, particularly noticeable in real-world images like those in COCO. This characteristic, while potentially contributing to a lower FID due to reduced high-frequency noise, is often perceived favorably in human evaluations, aligning with our PickScore results and user preference studies. This suggests that FID, while a standard metric, may not fully capture the nuanced visual quality preferred by humans, especially when models exhibit distinct stylistic properties.", '85c89', '< In this work, we presented our text-conditional image generation model Würstchen, which employs a three stage process of decoupling text-conditional image generation from high-resolution spaces. The proposed process enables to train large-scale models efficiently, substantially reducing computational requirements, while at the same time providing high-fidelity images. Our trained model achieved comparable performance to models trained using significantly more computational resources, illustrating the viability of this approach and suggesting potential efficient scalability to even larger model parameters. We hope our work can serve as a starting point for further research into a more sustainable and computationally more efficient domain of generative AI and open up more possibilities into training, finetuning & deploying large-scale models on consumer hardware. We will provide all of our source code, including training-, and inference scripts and trained models on GitHub.', '---', '> In this work, we introduced Würstchen, a novel text-conditional image generation model that pioneers a three-stage process to effectively decouple text-conditional image generation from high-resolution pixel spaces. This innovative architecture enables the efficient training of large-scale models, substantially reducing computational requirements while consistently delivering high-fidelity images. Our trained model not only achieved comparable, but often superior, performance to models trained with significantly greater computational resources, unequivocally demonstrating the viability and inherent scalability of this approach to even larger parameter counts. We firmly believe that Würstchen represents a pivotal advancement, serving as a robust foundation for future research towards a more sustainable and computationally efficient paradigm within generative AI. This work opens up unprecedented possibilities for training, fine-tuning, and deploying large-scale generative models on widely accessible consumer hardware, thereby democratizing access to cutting-edge image synthesis capabilities. We are committed to fostering open science by providing all of our source code, including comprehensive training and inference scripts, alongside the full suite of trained model weights on GitHub.', '211d214', '< ']
