Superposition in Mixture of Experts

Published: 30 Sept 2025, Last Modified: 30 Sept 2025Mech Interp Workshop (NeurIPS 2025) PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Understanding high-level properties of models, Foundational work
Other Keywords: Mixture of Experts, Superposition, Mechanistic Interpretability
TL;DR: Mixture of Experts exhibit less superposition than dense models while maintaining comparable loss, suggesting that network sparsity may offer a path toward more interpretable AI systems.
Abstract: Superposition allows neural networks to represent far more features than they have dimensions. Previous work has explored how superposition is affected by attributes of the data. Mixture of Experts (MoE) models are used in state-of-the-art large language models and provide a network parameter that affects superposition: network sparsity. We investigate how network sparsity (the ratio of active to total experts) in MoEs affects superposition and feature representation. We extend Elhage et al. [2022]’s toy model framework to MoEs and develop new metrics to understand superposition across experts. Our findings demonstrate that MoEs consistently exhibit greater monosemanticity than their dense counterparts. Unlike dense models that show discrete phase transitions, MoEs exhibit continuous phase transitions as network sparsity increases. We define expert specialization through monosemantic feature representation rather than load balancing, showing that experts naturally organize around coherent feature combinations and maintain specialization when initialized appropriately. Our results suggest that network sparsity in MoEs may enable more interpretable models without sacrificing performance, challenging the view that interpretability and capability are fundamentally at odds.
Submission Number: 284
Loading