Exploring Expert Monosemanticity in MoE LLMs: Insights into Understanding and Fine-Tuning

Exploring Expert Monosemanticity in MoE LLMs: Insights into Understanding and Fine-Tuning

ACL ARR 2026 January Submission9547 Authors

06 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Mixture-of-Experts (MoE), Monosemanticity Analysis Framework, Interpretability, Sparse Autoencoders (SAE), Knowledge Preservation

Abstract: Mixture-of-Experts (MoE) architectures enhance the scalability and efficiency of large language models (LLMs) by activating only a subset of parameters. However, the interpretability of individual experts and corresponding strategies for post-training adaptation to domain-specific tasks remain underexplored. In this work, we first develop an interpretability framework for expert-level monosemanticity in MoE models using sparse autoencoders, offering new insights into the specialization patterns of domain experts. Building on this, we propose an expert-frozen fine-tuning method that selectively updates domain-specific experts while keeping domain-agnostic experts fixed. We demonstrate that the inherent sparsity of MoE models encourages stronger monosemantic behavior at the expert level, which allows for the identification of experts responsible for particular downstream tasks and aids in preserving cross-domain performance. The proposed strategy further alleviates catastrophic forgetting and reduces computational overhead by limiting updates to domain-relevant experts. Experiments on specialized domains, including medical and legal corpora, show that our approach performs on par with or better than fully fine-tuned models on in-domain tasks, while achieving relative improvements of 21.19\% and 60.58\% in retaining performance on out-of-domain benchmarks. Compared to parameter-efficient fine-tuning baselines like LoRA, our method achieves superior performance on the target domain, while also achieving improvements of 11.29\% and 52.70\% in other domains.

Paper Type: Long

Research Area: Special Theme (conference specific)

Research Area Keywords: Explainability of NLP Models,Interpretability and Analysis of MoE LLM

Contribution Types: Model analysis & interpretability

Languages Studied: English

Submission Number: 9547

Loading