Keywords: Diffusion models, mechanistic interpretability, disentanglement, representation learning.
TL;DR: We introduce a method which can find interpretable features in diffusion models using Sparse Autoencoders.
Abstract: In this work, we introduce a computationally efficient method that allows Sparse Autoencoders (SAEs) to automatically detect interpretable directions within the latent space of diffusion models. We show that intervening on a single neuron in SAE representation space at a single diffusion time step leads to meaningful feature changes in model output. This marks a step toward applying techniques from mechanistic interpretability to controlling the outputs of diffusion models, further ensuring the safety of their generations. As such, we establish a connection between safety/interpretability methods from language modelling and image generative modelling.
Submission Number: 240
Loading