Abstract: Recent advances in latent diffusion models have demonstrated their effectiveness for high-resolution image synthesis. However, the properties of the latent space from tokenizer for better learning and generation of diffusion models remain under-explored. Theoretically and empirically, we find that improved generation quality is closely tied to the latent distributions with better structure, such as the ones with fewer Gaussian Mixture modes and more discriminative features. Motivated by these insights, we propose MAETok, an autoencoder (AE) leveraging mask modeling to learn semantically rich latent space while maintaining reconstruction fidelity.
Extensive experiments validate our analysis, demonstrating that the variational form of autoencoders is not necessary, and a discriminative latent space from AE alone enables state-of-the-art performance on ImageNet generation using only 128 tokens. MAETok achieves significant practical improvements, enabling a gFID of 1.69 with 76× faster training and 31× higher inference throughput for 512×512 generation. Our findings show that the structure of the latent space, rather than variational constraints, is crucial for effective diffusion models. Code and trained models will be released.
Lay Summary: Diffusion models are a leading approach in AI for generating realistic images. To make them efficient, these models often rely on compressing images into a simplified form—called a latent space—before generation begins. This compression is handled by a component known as a tokenizer. But what makes a tokenizer good for generating high-quality images?
This paper introduces MAETok, a new tokenizer based on masked autoencoders (MAE), which learn by hiding parts of the input and trying to reconstruct them. The authors show that tokenizers trained in this way produce more discriminative latent spaces—that is, they better capture meaningful differences between image features. This leads to more effective training and better results when used in diffusion models.
Through both theoretical analysis and practical experiments, the authors find that having a well-structured latent space is more important than using complex designs like variational autoencoders (VAE). MAETok achieves state-of-the-art image generation on standard benchmarks while being significantly faster and more efficient, using just 128 tokens—far fewer than most previous methods.
This work provides new insights into how latent representations impact generative performance and offers a practical, scalable solution for high-quality image synthesis.
Link To Code: https://github.com/Hhhhhhao/continuous_tokenizer
Primary Area: Deep Learning->Generative Models and Autoencoders
Keywords: Diffusion Models, Autoencoders, Masked Autoencoders
Submission Number: 14576
Loading