Abstract: In the past, normalizing generative flows have emerged as a promising class of generative models for natural images. This type of model has many modeling advantages: the ability to efficiently compute log-likelihood of the input data, fast generation, and simple overall structure. Normalizing flows remained a topic of active research but later fell out of favor, as visual quality of the samples was not competitive with other model classes, such as GANs, VQ-VAE-based approaches or diffusion models. In this paper we revisit the design of coupling-based normalizing flow models by carefully ablating prior design choices and using computational blocks based on the Vision Transformer architecture, not convolutional neural networks. As a result, we achieve a much simpler architecture that matches existing normalizing flow models and improves over them when paired with pretraining. While the overall visual quality is still behind the current state-of-the-art models, we argue that strong normalizing flow models can help advancing the research frontier by serving as building components of more powerful generative models.
Submission Length: Regular submission (no more than 12 pages of main content)
Changes Since Last Submission:
- Make SOTA claims more precise by emphasising that we rely on pertaining to beat SOTA and only match SOTA in the standard data-limited settings.
- Improve Fig. 3 by adding NLL results for the train split.
- Additional comparison to the Denseflow model in the Appendix.
- Typo fixes.
Assigned Action Editor: Ole Winther
Submission Number: 3927
Loading