RouterInterp: Understanding Superposed Specialisation in MoE Routing

Published: 02 Mar 2026, Last Modified: 03 Apr 2026Sci4DL 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: interpretability, mixture of experts, superposition, mechanistic interpretability, sparse autoencoders
TL;DR: We introduce a method for producing natural language explanations for expert routing in MoE models.
Abstract: Sparse Mixture of Experts (MoE) models scale more efficiently than dense models by routing tokens to modular expert networks that are only active when relevant to the task. A leading hypothesis for the performance of MoE models is that each expert specialises in a single, coherent domain. However, interpretability efforts that assume this hypothesis have generally been unsuccessful. We propose and present evidence for an alternative account that we call the *Superposed Specialisation Hypothesis* (SSH): experts specialise in a disjoint union of fine-grained features rather than one broad domain. Leveraging the SSH, we introduce *RouterInterp*, a method for interpreting expert routing that identifies Sparse Autoencoder features most predictive of routing decisions and produces unified natural language explanations. On gpt-oss-20b, explanations from RouterInterp predict expert routing with 77% higher accuracy than prior methods. This work provides a scalable method for generating concise and more accurate explanations of expert routing and increases our understanding of a previously uninterpretable component of foundation models.
Anonymization: This submission has been anonymized for double-blind review via the removal of identifying information such as names, affiliations, and identifying URLs.
Style Files: I have used the style files.
Submission Number: 36
Loading