Towards Data-centric Interpretability with Sparse Autoencoders

Published: 30 Sept 2025, Last Modified: 30 Sept 2025Mech Interp Workshop (NeurIPS 2025) SpotlightEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Sparse Autoencoders, Applications of interpretability
TL;DR: We use sparse autoencoders for four data analysis tasks--data diffing, correlations, targeted clustering, and retrieval
Abstract: We use sparse autoencoders (SAEs) trained on LLM activations for data analysis, with an emphasis on interpreting LLM data. SAEs provide an unsupervised feature space for undirected hypothesis generation, which we show is useful for (1) finding differences between datasets, and (2) finding unexpected correlations between concepts. Notably, model diffing can be done by diffing model outputs, and we find behaviors such as Grok 4 expressing more caution than other frontier models. SAE activations also act as interpretable “property tags” that represent text. We show they are a useful alternative to traditional text embeddings in (1) clustering texts to uncover novel groupings and (2) retrieving texts based on implicit properties. We position SAEs as a novel and versatile tool for data analysis, and highlight data-centric interpretability as an important direction for future work.
Submission Number: 84
Loading