Vision Foundation Models Can Be Good Tokenizers for Latent Diffusion Models

Vision Foundation Models Can Be Good Tokenizers for Latent Diffusion Models

ICLR 2026 Conference Submission17068 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: latent diffusion, variational autoencoder, vision foundation models

Abstract: While latent diffusion models (LDMs) have demonstrated remarkable success in visual generation, the visual tokenizers has proven crucial for effective LDM training. While recent advances have explored incorporating Vision Foundation Model (VFM) representations into visual tokenizers through distillation, our experiment suggests representation degradation happened to these methods. In this paper, we propose a more straight-forward approach to directly leverage frozen VFM encoders within the VAE architecture, proposing Vision Foundation Model Variational Autoencoder (VFM-VAE). To address the tension between semantic richness and reconstruction fidelity, we introduce Multi-Scale Latent Fusion and Progressive Resolution Reconstruction blocks in VFM-VAE decoder, enabling high-quality image reconstruction from semantically-rich but spatially-coarse VFM representations. Furthermore, we present a comprehensive analysis of representation dynamics during diffusion training, introducing SE-CKNNA metric and exploring the representation relationship between visual tokenizer and LDMs. Our visual tokenizer design and analysis translates into superior generative performance: our diffusion model reaches a generation FID of 2.2 without CFG at merely 80 epochs which is a 10× speedup over prior visual tokenizers. With extra alignment within LDMs, VFM-VAE further attains 1.62 FID at 640 epochs, establishing direct VFM integration as a superior paradigm for LDMs.

Supplementary Material: zip

Primary Area: generative models

Submission Number: 17068

Loading