Emergence and Evolution of Interpretable Concepts in Diffusion Models Through the Lens of Sparse Autoencoders
Keywords: mechanistic interpretability, diffusion models, sparse autoencoders, controlled generation
TL;DR: We investigate how human-interpretable concepts evolve in diffusion models through the generative process.
Abstract: Diffusion models (DMs) can generate diverse images with exceptional visual quality aligned with the input natural language prompt. However, the inner workings of DMs, especially the evolution of internal representations throughout the generative process is still largely a mystery. Mechanistic interpretability techniques, such as Sparse Autoencoders (SAEs), aim at uncovering the fundamental operating principles of models through granular analysis of their features, and have been successful in understanding and steering the behavior of large language models at scale. In this work, we leverage the SAE framework to probe the inner workings of text-to-image DMs, and uncover a variety of human-interpretable concepts in their activations. We find that \textit{even before the first reverse diffusion step} is completed, the final composition of the scene can be predicted surprisingly well by looking at the spatial distribution of activated concepts. We find that while image composition is mostly finalized by the middle of the reverse process, image style is still subject to change. Finally, we design SAE-based interventions that control the layout and style of the generated image.
Submission Number: 39
Loading