Beyond Redundancy: Diverse and Specialized Multi-Expert Sparse Autoencoder

14 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Sparse Autoencoder, LLM Interpretability, Mixture-of-Experts, Representation Learning
TL;DR: We resolve feature redundancy in the Mixture of Experts Sparse Autoencoder by co-activating multiple experts and scaling high-frequency features, which results in more diverse and specialized dictionaries.
Abstract: Sparse autoencoders (SAEs) have emerged as a powerful tool for interpreting large language models (LLMs) by decomposing token activations into combinations of human-understandable features. While SAEs provide crucial insights into LLM explanations, their practical adoption faces a fundamental challenge: better interpretability demands that SAEs' hidden layers have high dimensionality to satisfy sparsity constraints, resulting in prohibitive training and inference costs. Recent Mixture of Experts (MoE) approaches attempt to address this by partitioning SAEs into narrower expert networks with gated activation, thereby reducing computation. In a well-designed MoE, each expert should focus on learning a distinct set of features. However, we identify a *critical limitation* in MoE-SAE: Experts often fail to specialize, which means they frequently learn overlapping or identical features. To deal with it, we propose two key innovations: (1) Multiple Expert Activation that simultaneously engages semantically weighted expert subsets to encourage specialization, and (2) Feature Scaling that enhances diversity through adaptive high-frequency scaling. Experiments demonstrate 24\% lower reconstruction error and reduced feature redundancy compared to existing MoE-SAE methods. This work bridges the interpretability-efficiency gap in LLM analysis, allowing transparent model inspection without compromising computational feasibility. Our code is publicly available at https://anonymous.4open.science/r/scale_sae-C6D0/.
Primary Area: interpretability and explainable AI
Submission Number: 5226
Loading