Does Higher Interpretability Imply Better Utility? A Pairwise Analysis on Sparse Autoencoders

Published: 26 Jan 2026, Last Modified: 11 Feb 2026ICLR 2026 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Sparse Autoencoders; Interpretability; Utility
Abstract: Sparse Autoencoders (SAEs) are widely used to steer large language models (LLMs), based on the assumption that their interpretable features naturally enable effective model behavior steering. Yet a fundamental question remains: does higher interpretability imply better steering utility? To answer this, we train 90 SAEs across three LLMs (Gemma-2-2B, Qwen-2.5-3B, Gemma-2-9B), spanning five architectures and six sparsity levels. We evaluate interpretability with SAEBench (Karvonen et al., 2025) and steering utility with AxBench (Wu et al., 2025), and analyze rank agreement via Kendall’s rank coefficient $\tau_b$. Our analysis reveals only a relatively weak positive association ($\tau_b \approx 0.298$), indicating that interpretability is an insufficient proxy for steering performance. We conjecture the interpretability–utility gap stems from feature selection: not all SAE features are equally effective for steering. To identify features that truly steer LLM behavior, we propose a novel selection criterion, $\Delta$ Token Confidence, which measures how much amplifying a feature changes the next-token distribution. Our method improves steering performance on three LLMs by **52.52\%** compared to the best prior output score–based criterion (Arad et al., 2025). Strikingly, after selecting features with high $\Delta$ Token Confidence, the correlation between interpretability and utility vanishes ($\tau_b \approx 0$) and can even become negative, further highlighting their divergence for the most effective steering features.
Supplementary Material: zip
Primary Area: interpretability and explainable AI
Submission Number: 10423
Loading