Keywords: image tokenizer, latent diffusion model, image synthesis
TL;DR: A semantic-disentangled VAE for effective and efficient training of latent diffusion models
Abstract: Latent Diffusion Models (LDMs) rely on image tokenizers, typically implemented as Variational Autoencoders (VAEs), to compress high-dimensional images into compact latent space, facilitating efficient generative modeling. We contend that VAEs trained solely on pixel-level reconstruction objective struggle to capture rich semantic information, which poses challenges for the modeling of downstream diffusion models. In this paper, we propose that a generation-friendly VAE should have the ability of semantic disentanglement, which means it can encode attribute-level semantic information more effectively. To address this, we introduce Semantic-disentangled VAE (Send-VAE), which leverages the rich semantic knowledge from pre-trained vision foundation models to improve the VAE’s ability to disentangle semantics. Specifically, we employ a sophisticated non-linear mapper network to transform VAE’s latent representations, then align them with the representations from vision foundation models. The mapper network is designed to bridge the representation gap between VAE and vision foundation models, thus facilitating effective guidance for VAE learning. Additionally, we implement linear probing on attribute prediction tasks to assess the VAE’s semantic disentanglement ability, demonstrating a strong correlation with downstream generation performance. Finally, utilizing on the proposed Send-VAE, we train popular flow-based transformers SiTs, and experimental results indicate that our proposed Send-VAE can significantly speed up SiT training and achieves a new state-of-the-art FID score of 1.21 and 1.75 with and without classifier free guidance on ImageNet 256 × 256 resolution.
Primary Area: generative models
Submission Number: 3322
Loading