Improving the Diffusability of Autoencoders

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY-NC 4.0
TL;DR: We explore the spectral properties πŸ“Š of modern autoencoders πŸ€– used for image πŸ–ΌοΈ / video πŸ“Ή latent diffusion 🌬️ training πŸ‹οΈ and find πŸ” that a simple downsampling ⬇️ regularization πŸ“ can substantially boost πŸš€ their downstream 🌊 LDM performance πŸ“ˆπŸ’₯.
Abstract: Latent diffusion models have emerged as the leading approach for generating high-quality images and videos, utilizing compressed latent representations to reduce the computational burden of the diffusion process. While recent advancements have primarily focused on scaling diffusion backbones and improving autoencoder reconstruction quality, the interaction between these components has received comparatively less attention. In this work, we perform a spectral analysis of modern autoencoders and identify inordinate high-frequency components in their latent spaces, which are especially pronounced in the autoencoders with a large bottleneck channel size. We hypothesize that this high-frequency component interferes with the coarse-to-fine nature of the diffusion synthesis process and hinders the generation quality. To mitigate the issue, we propose scale equivariance: a simple regularization strategy that aligns latent and RGB spaces across frequencies by enforcing scale equivariance in the decoder. It requires minimal code changes and only up to $20$K autoencoder fine-tuning steps, yet significantly improves generation quality, reducing FID by 19% for image generation on ImageNet-1K 256x256 and FVD by at least 44% for video generation on Kinetics-700 17x256x256. The source code is available at https://github.com/snap-research/diffusability.
Lay Summary: In recent years, image πŸ–ΌοΈ and video πŸ“Ή generation models have rapidly advanced πŸš€, with both industry 🏒 and academia πŸŽ“ investing heavily πŸ’Έ. Most of these models follow the latent diffusion 🌬️ approach: an autoencoder πŸ€– first compresses images or videos into a smaller latent space πŸŒ€, and then a diffusion model is trained πŸ‹οΈ to generate samples in that space πŸ§ͺ. So far, most work has focused on improving πŸ”§ the autoencoder’s reconstruction quality πŸ” and compression rate πŸ“¦. But our work shows πŸ’‘ that the choice of autoencoder has a deeper effectβ€”it shapes 🧩 how well a diffusion model can generate realistic outputs 🎨. We call this diffusability ✨: how easy 😌 it is for a diffusion model to learn πŸ“š to generate in a given representation space πŸ“ˆ. Diffusion models build 🧱 images by gradually refining noise 🌫️, starting from a blurry outline and adding details ✏️ step by step πŸ”„. This process tends to struggle πŸ˜– with high-frequency details πŸ“Ά (like textures 🧡 or fine edges βœ‚οΈ), where errors ❌ can accumulate. Normally, the human eye πŸ‘οΈ is less sensitive πŸ§˜β€β™‚οΈ to these errors in pixel space 🧷. But we found 🧠 that some autoencoders place more emphasis πŸ“£ on high frequencies in their latent spaceβ€”more than RGB images do 🌈. As a result ⚠️, critical image structures πŸ—οΈ get encoded in unstable πŸ’₯ high-frequency components, making them harder 😡 for the diffusion model to learn and sample correctly 🎯. To address this πŸ› οΈ, we introduce a simple training technique πŸ“: during autoencoder training, we downsample ⬇️ the latent representation and require the decoder to still produce a meaningful reconstruction πŸ› οΈβž‘οΈπŸ–ΌοΈ. This encourages πŸ™Œ the autoencoder to store important information ℹ️ in more robust πŸ’ͺ, low-frequency components 🧊. We show πŸ§ͺ that this small change πŸ”§ leads to large improvements πŸ“ˆ. It makes latent spaces more suitable βœ… for diffusion models, improving both image πŸ–ΌοΈ and video πŸ“Ή generation quality 🎯 on benchmarks like ImageNet 🧠 and Kinetics πŸƒβ€β™‚οΈ.
Link To Code: https://github.com/snap-research/diffusability
Primary Area: Deep Learning->Generative Models and Autoencoders
Keywords: autoencoders, latent diffusion, image generation, video generation, DCT, diffusability, fourier transform
Submission Number: 5299
Loading