SAeUron: Interpretable Concept Unlearning in Diffusion Models with Sparse Autoencoders

Bartosz Cywiński; Kamil Deja

SAeUron: Interpretable Concept Unlearning in Diffusion Models with Sparse Autoencoders

Bartosz Cywiński, Kamil Deja

Published: 01 May 2025, Last Modified: 23 Jul 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

TL;DR: Interpretable Concept Unlearning in Diffusion Models with Sparse Autoencoders

Abstract: Diffusion models, while powerful, can inadvertently generate harmful or undesirable content, raising significant ethical and safety concerns. Recent machine unlearning approaches offer potential solutions but often lack transparency, making it difficult to understand the changes they introduce to the base model. In this work, we introduce SAeUron, a novel method leveraging features learned by sparse autoencoders (SAEs) to remove unwanted concepts in text-to-image diffusion models. First, we demonstrate that SAEs, trained in an unsupervised manner on activations from multiple denoising timesteps of the diffusion model, capture sparse and interpretable features corresponding to specific concepts. Building on this, we propose a feature selection method that enables precise interventions on model activations to block targeted content while preserving overall performance. Our evaluation shows that SAeUron outperforms existing approaches on the UnlearnCanvas benchmark for concepts and style unlearning, and effectively eliminates nudity when evaluated with I2P. Moreover, we show that with a single SAE, we can remove multiple concepts simultaneously and that in contrast to other methods, SAeUron mitigates the possibility of generating unwanted content under adversarial attack. Code and checkpoints are available at [GitHub]( https://github.com/cywinski/SAeUron).

Lay Summary: Powerful image generation models can produce unwanted content, like inappropriate images, raising significant safety and ethical concerns. While we can try to prevent this by adjusting the model's parameters, such an approach often makes it difficult to understand the specific changes being introduced within the model itself. **Could we instead directly edit a model's internal workings in a transparent way?** To achieve this, we train a neural network to decompose a model's internal representations into a set of human-interpretable features. This allows us to pinpoint and specifically block only the features responsible for an unwanted concept – for instance, to prevent the model from generating 'cats,' we can identify and block its internal 'cat paw' and 'whisker' features. Because we only target these relevant features, the model's overall performance remains largely unaffected. This approach not only gives us more control over what image generation models produce but also proves more effective at removing unwanted concepts than existing methods. Crucially, we can actually see and understand which internal features of the model are being modified, offering genuine control. Our work demonstrates how techniques that help us understand these complex models can be directly applied to solve significant real-world challenges.

Link To Code: https://github.com/cywinski/SAeUron

Primary Area: Social Aspects->Accountability, Transparency, and Interpretability

Keywords: sparse autoencoder, diffusion model, unlearning, interpretability

Submission Number: 4020

Loading