Keywords: Normalizing Flows, Autoregressive Models, Latent Autoregressive Flows, Scalable Image Generation
TL;DR: We propose STARFlow, a scalable autoregressive flow model that enables high-quality image generation
Abstract: We present STARFlow, a scalable generative model based on normalizing flows that achieves strong performance on high-resolution image synthesis.
STARFlow's main building block is Transformer Autoregressive Flow (TARFlow), which combines normalizing flows with Autoregressive Transformer architectures and has recently achieved impressive results in image modeling. In this work, we first establish the theoretical universality of TARFlow for modeling continuous distributions. Building on this foundation, we introduce a set of architectural and algorithmic innovations that significantly enhance the scalability: (1) a deep-shallow design where a deep Transformer block captures most of the model’s capacity, followed by a few shallow Transformer blocks that are computationally cheap yet contribute non-negligibly, (2) learning in the latent space of pretrained autoencoders, which proves far more effective than modeling pixels directly, and (3) a novel guidance algorithm that substantially improves sample quality. Crucially, our model remains a single, end-to-end normalizing flow, allowing exact maximum likelihood training in continuous space without discretization. STARFlow achieves competitive results in both class- and text-conditional image generation, with sample quality approaching that of state-of-the-art diffusion models. To our knowledge, this is the **first** successful demonstration of normalizing flows at this scale and resolution. Code and weights available at https://github.com/apple/ml-starflow.
Supplementary Material: zip
Primary Area: Deep learning (e.g., architectures, generative models, optimization for deep networks, foundation models, LLMs)
Submission Number: 10954
Loading