Abstract: Diffusion models have attained impressive visual quality for image synthesis. However, how to probe and manipulate the latent space of diffusion models has not been extensively explored. Prior work diffusion autoencoders encode the semantic representations with a single latent code, neglecting the low-level details and leading to entangled representations. To mitigate those limitations, we propose Hierarchical Diffusion Autoencoders (HDAE) that exploits the coarse-to-fine feature hierarchy for the latent space of diffusion models. Our HDAE converges 2+ times faster and encodes richer and more comprehensive coarse-to-fine representations of images. The hierarchical latent space inherently disentangles different semantic levels of features. Furthermore, we propose a truncated feature based approach for disentangled image manipulation. We demonstrate the effectiveness of our proposed HDAE with extensive experiments and applications on image reconstruction, style mixing, controllable interpolation, image editing, and multi-modal semantic image synthesis. The code will be released upon acceptance.
Loading