Probabilistic TopK Sparse Autoencoder for Interpreting the Activations of Large Language Models

Probabilistic TopK Sparse Autoencoder for Interpreting the Activations of Large Language Models

ICLR 2026 Conference Submission22189 Authors

20 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: interpretability, sparse autoencoders

TL;DR: We introduce Probabilistic TopK SAE, which uses stochastic gating through the Binary Concrete distributions to allow for more exploration of features during training.

Abstract: Sparse Autoencoders (SAEs) have emerged as a popular solution for extracting interpretable features from language model activations, enabling mechanistic understanding by decomposing polysemantic neurons into sparsely activated dictionary components. However, existing SAE designs suffer from deterministic, activations that starve gradients to ``dead'' components, and produce uncalibrated coefficients that provide no meaningful notion of uncertainty. To address these limitations, we introduce Probabilistic TopK SAEs, a novel approach that augments the TopK autoencoder with probabilistic gating through the binary Concrete distribution. This stochastic sampling helps mitigate gradient starvation to dead neurons while producing coefficient magnitudes that are more correlated with the confidence of feature presence. Empirical experiments with GPT-2 and Qwen3 shows that our method achieves consistent Pareto improvements over the baselines in high sparsity settings (small number of activated features) while maintaining a larger set of alive dictionary features. Further, we show that the coefficients magnitude from our approach exhibit stronger correlation between activation strength and interpretability scores, resulting in more faithful explanations for the neurons.

Supplementary Material: zip

Primary Area: interpretability and explainable AI

Submission Number: 22189

Loading