When Worse is Better: Navigating the Compression Generation Trade-off In Visual Tokenization

Vivek Ramanujan; Kushal Tirumala; Armen Aghajanyan; Luke Zettlemoyer; Ali Farhadi

When Worse is Better: Navigating the Compression Generation Trade-off In Visual Tokenization

Vivek Ramanujan, Kushal Tirumala, Armen Aghajanyan, Luke Zettlemoyer, Ali Farhadi

Published: 18 Sept 2025, Last Modified: 29 Oct 2025NeurIPS 2025 spotlightEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Representation Learning, Scaling Laws, Image Generation

TL;DR: We study how different changes to the auto-encoder rate-distortion trade-off affect discrete auto-regressive generation performance through scaling laws.

Abstract: Current image generation methods are based on a two-stage training approach. In stage 1, an auto-encoder is trained to compress an image into a latent space; in stage 2, a generative model is trained to learn a distribution over that latent space. This reveals a fundamental trade-off, do we compress more aggressively to make the latent distribution easier for the stage 2 model to learn even if it makes reconstruction worse? We study this problem in the context of discrete, auto-regressive image generation. Through the lens of scaling laws, we show that smaller stage 2 models can benefit from more compressed stage 1 latents even if reconstruction performance worsens, demonstrating that generation modeling capacity plays a role in this trade-off. Diving deeper, we rigorously study the connection between compute scaling and the stage 1 rate-distortion trade-off. Next, we introduce Causally Regularized Tokenization (CRT), which uses knowledge of the stage 2 generation modeling procedure to embed useful inductive biases in stage 1 latents. This regularization improves stage 2 generation performance better by making the tokens easier to model without affecting the stage 1 compression rate and marginally affecting distortion: we are able to improve compute efficiency 2-3$\times$ over baseline. Finally, we use CRT with further optimizations to the visual tokenizer setup to result in a generative pipeline that matches LlamaGen-3B generation performance (2.18 FID) with half the tokens per image (256 vs. 576) and a fourth the total model parameters (775M vs. 3.1B) while using the same architecture and inference procedure.

Primary Area: Deep learning (e.g., architectures, generative models, optimization for deep networks, foundation models, LLMs)

Submission Number: 13747

Loading