Keywords: Mechanistic Interpretability, Polysemanticity, Sparse Autoencoders, Amortization Inference
TL;DR: Amortized encoding's global optimality conflicts with monosemantic instance-level optimality. We advocate reducing investment in purely amortization-based methods.
Abstract: Polysemy has long been a major challenge in Mechanistic Interpretability (MI), with Sparse Autoencoders (SAEs) emerging as a promising solution. SAEs employ a shared encoder to map inputs to sparse codes, thereby amortizing inference costs across all instances. However, this parameter-sharing paradigm inherently conflicts with the MI community's emphasis on instance-level optimality, including the consistency and stitchability of monosemantic features. Thus, this paper advocates for reduced investment in amortization-based encoding methods for polysemy disentanglement. We first reveal the trade-off relationships among various pathological phenomena, including feature absorption, feature splitting, dead latents, and dense latents under global reconstruction-sparsity constraints from the perspective of training dynamics, finding that increased sparsity typically exacerbates multiple pathological phenomena, and attribute this trade-off relationship to amortized inference. As the first step in this new direction, we also explore semi-amortized and non-amortized encoding methods and find that they can significantly mitigate many limitations of SAEs. This work provides insights for understanding SAEs and suggests a paradigm shift for future research for polysemy disentanglement. The code is available at \url{https://anonymous.4open.science/r/sae-amortization-5335}.
Supplementary Material: zip
Primary Area: interpretability and explainable AI
Submission Number: 8846
Loading