Jet: A Modern Transformer-Based Normalizing Flow

Alexander Kolesnikov; André Susano Pinto; Michael Tschannen

Jet: A Modern Transformer-Based Normalizing Flow

Alexander Kolesnikov, André Susano Pinto, Michael Tschannen

Published: 22 Apr 2025, Last Modified: 22 Apr 2025Accepted by TMLREveryoneRevisionsBibTeXCC BY 4.0

Abstract: In the past, normalizing generative flows have emerged as a promising class of generative models for natural images. This type of model has many modeling advantages: the ability to efficiently compute log-likelihood of the input data, fast generation, and simple overall structure. Normalizing flows remained a topic of active research but later fell out of favor, as visual quality of the samples was not competitive with other model classes, such as GANs, VQ-VAE-based approaches or diffusion models. In this paper we revisit the design of coupling-based normalizing flow models by carefully ablating prior design choices and using computational blocks based on the Vision Transformer architecture, not convolutional neural networks. As a result, we achieve a much simpler architecture that matches existing normalizing flow models and improves over them when paired with pretraining. While the overall visual quality is still behind the current state-of-the-art models, we argue that strong normalizing flow models can help advancing the research frontier by serving as building components of more powerful generative models.

Submission Length: Regular submission (no more than 12 pages of main content)

Changes Since Last Submission: - Make SOTA claims more precise by emphasising that we rely on pertaining to beat SOTA and only match SOTA in the standard data-limited settings. - Improve Fig. 3 by adding NLL results for the train split. - Additional comparison to the Denseflow model in the Appendix. - Typo fixes.

Code: https://github.com/google-research/big_vision

Assigned Action Editor: ~Ole_Winther1

Submission Number: 3927

Loading