Steering Diffusion Transformers with Sparse Autoencoders

Steering Diffusion Transformers with Sparse Autoencoders

ICLR 2026 Conference Submission22251 Authors

20 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Diffusion Transformers, Sparse Autoencoders, Mechanistic Interpretability, Steering, Flux, Stable Diffusion 3

Abstract: While diffusion models are increasingly deployed in creative and scientific domains, their interpretability has received far less attention than that of large language models. Sparse autoencoders (SAEs) have recently emerged as a powerful tool for mechanistic interpretability, yet prior work has largely focused on U-Net architectures, leaving diffusion transformers (DiTs) underexplored. In this paper, we conduct the first systematic study of using SAEs to steer DiTs, showing how a carefully designed steering algorithm makes more effective use of extracted features to improve controlled editing and reveals key aspects of the model’s internal dynamics. We make three core contributions: (1) we introduce a multi-layer steering method that increases the causal effectiveness of feature injections while mitigating undesired artifacts, (2) we propose a simple similarity-based criterion for detecting feature presence in the residual stream, which guides the choice of layers for steering, and (3) we train SAEs on transformer block updates and residual stream states and apply our best steering method to identify which components of the DiT exert the strongest causal effects on targeted concept edits. Together, these findings provide new insights into the behavior of DiTs and offer practical guidance for practitioners seeking to steer generations effectively.

Primary Area: interpretability and explainable AI

Submission Number: 22251

Loading