Learnings from Scaling Visual Tokenizers for Reconstruction and Generation

Philippe Hansen-Estruch; David Yan; Ching-Yao Chuang; Orr Zohar; Jialiang Wang; Tingbo Hou; Tao Xu; Sriram Vishwanath; Peter Vajda; Xinlei Chen

Learnings from Scaling Visual Tokenizers for Reconstruction and Generation

Philippe Hansen-Estruch, David Yan, Ching-Yao Chuang, Orr Zohar, Jialiang Wang, Tingbo Hou, Tao Xu, Sriram Vishwanath, Peter Vajda, Xinlei Chen

Published: 01 May 2025, Last Modified: 23 Jul 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Visual tokenization via auto-encoding empowers state-of-the-art image and video generative models by compressing pixels into a latent space. However, questions remain about how auto-encoder design impacts reconstruction and downstream generative performance. This work explores scaling in auto-encoders for reconstruction and generation by replacing the convolutional backbone with an enhanced Vision Transformer for Tokenization (ViTok). We find scaling the auto-encoder bottleneck correlates with reconstruction but exhibits a nuanced relationship with generation. Separately, encoder scaling yields no gains, while decoder scaling improves reconstruction with minimal impact on generation. As a result, we determine that scaling the current paradigm of auto-encoders is not effective for improving generation performance. Coupled with Diffusion Transformers, ViTok achieves competitive image reconstruction and generation performance on 256p and 512p ImageNet-1K. In videos, ViTok achieves SOTA reconstruction and generation performance on 16-frame 128p UCF-101.

Lay Summary: Visual tokenization is a method that compresses the original visual data into forms easier to generate in. Typically auto-encoders use CNNs to perform this compression, but ViTs have been explored as well. Our research investigates how scaling these auto-encoders affects the quality of reconstruction (how accurately the original image is rebuilt) and how it effects downstream generation of images or videos. We discovered that making the compression part (bottleneck) bigger helps reconstruct images better but doesn't consistently enhance the quality of generations. Enlarging the encoder showed no gain, whereas enlarging the decoder improved image reconstruction but again had limited effect on generation. Ultimately, our findings indicate that simply making current auto-encoder designs larger doesn't significantly enhance their ability to generate better images or videos. Still, when combined with other techniques, our ViTok approach achieves strong performance in recreating and generating realistic visuals, even surpassing previous best results for short video clips.

Application-Driven Machine Learning: This submission is on Application-Driven Machine Learning.

Link To Code: https://github.com/Na-VAE/navae

Primary Area: Deep Learning->Generative Models and Autoencoders

Keywords: Auto-Encoder, Visual Compression, Latent Diffusion Models, Vision Transformer

Submission Number: 7697

Loading