Keywords: latent diffusion, variational autoencoder, vision foundation models
Abstract: While latent diffusion models (LDMs) have demonstrated remarkable success in visual generation, the visual tokenizers has proven crucial for effective LDM training. While recent advances have explored incorporating Vision Foundation Model (VFM) representations into visual tokenizers through distillation, our experiment suggests representation degradation happened to these methods. In this paper, we propose a more straight-forward approach to directly leverage frozen VFM encoders within the VAE architecture, proposing Vision Foundation Model Variational Autoencoder (VFM-VAE). To address the tension between semantic richness and reconstruction fidelity, we introduce Multi-Scale Latent Fusion and Progressive Resolution Reconstruction blocks in VFM-VAE decoder, enabling high-quality image reconstruction from semantically-rich but spatially-coarse VFM representations. Furthermore, we present a comprehensive analysis of representation dynamics during diffusion training, introducing SE-CKNNA metric and exploring the representation relationship between visual tokenizer and LDMs. Our visual tokenizer design and analysis translates into superior generative performance: our diffusion model reaches a generation FID of 2.2 without CFG at merely 80 epochs which is a 10× speedup over prior visual tokenizers. With extra alignment within LDMs, VFM-VAE further attains 1.62 FID at 640 epochs, establishing direct VFM integration as a superior paradigm for LDMs.
Supplementary Material: zip
Primary Area: generative models
Submission Number: 17068
Loading