A is for Absorption: Studying Feature Splitting and Absorption in Sparse Autoencoders

Published: 10 Oct 2024, Last Modified: 03 Dec 2024IAI Workshop @ NeurIPS 2024EveryoneRevisionsBibTeXCC BY 4.0
Keywords: mechanistic-interpretability, NLP, interpretability, sparse-autoencoders
TL;DR: We investigate SAE feature interpretability using spelling as a case study, and find a novel form of feature splitting called "feature absorption".
Abstract: Sparse Autoencoders (SAEs) have emerged as a promising approach to decom- pose the activations of Large Language Models (LLMs) into human-interpretable components. But to what extent do SAEs extract monosemantic and interpretable latents? We systematically evaluate precision and recall of a large number of SAEs with varying width and sparsity on a first-letter identification task, where we have complete access to ground truth labels for all tokens in the vocabulary. Critically, we identify a problematic form of feature-splitting we call “feature absorption” where seemingly monosemantic latents fail to fire in cases where they apparently should. Our investigation suggests that varying SAE size or sparsity is insufficient to solve this issue, and that this is a more fundamental problem related to promoting sparsity in the presence of co-occurring features.
Track: Main track
Submitted Paper: No
Published Paper: No
Submission Number: 23
Loading