Keywords: Mixture-of-Experts (MoE), Monosemanticity Analysis Framework, Interpretability, Sparse Autoencoders (SAE), Knowledge Preservation
Abstract: Mixture-of-Experts (MoE) architectures enhance the scalability and efficiency of large language models (LLMs) by activating only a subset of parameters. However, the interpretability of individual experts and corresponding strategies for post-training adaptation to domain-specific tasks remain underexplored. In this work, we first develop an interpretability framework for expert-level monosemanticity in MoE models using sparse autoencoders, offering new insights into the specialization patterns of domain experts. Building on this, we propose an expert-frozen fine-tuning method that selectively updates domain-specific experts while keeping domain-agnostic experts fixed. We demonstrate that the inherent sparsity of MoE models encourages stronger monosemantic behavior at the expert level, which allows for the identification of experts responsible for particular downstream tasks and aids in preserving cross-domain performance. The proposed strategy further alleviates catastrophic forgetting and reduces computational overhead by limiting updates to domain-relevant experts. Experiments on specialized domains, including medical and legal corpora, show that our approach performs on par with or better than fully fine-tuned models on in-domain tasks, while achieving relative improvements of 21.19\% and 60.58\% in retaining performance on out-of-domain benchmarks. Compared to parameter-efficient fine-tuning baselines like LoRA, our method achieves superior performance on the target domain, while also achieving improvements of 11.29\% and 52.70\% in other domains.
Paper Type: Long
Research Area: Special Theme (conference specific)
Research Area Keywords: Explainability of NLP Models,Interpretability and Analysis of MoE LLM
Contribution Types: Model analysis & interpretability
Languages Studied: English
Submission Number: 9547
Loading