Understanding and Leveraging Expert-level Monosemanticity in Finetuning MoE LLMs

Mingxue Tian; Qi Zhang; Xiaohan Wang; Jiajun Chai; Guojun Yin; Wei Lin; Cailian Chen; Qimin Xu; Siyu Wang; Yisen Wang

Understanding and Leveraging Expert-level Monosemanticity in Finetuning MoE LLMs

Mingxue Tian, Qi Zhang, Xiaohan Wang, Jiajun Chai, Guojun Yin, Wei Lin, Cailian Chen, Qimin Xu, Siyu Wang, Yisen Wang

17 Sept 2025 (modified: 06 Jan 2026)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Mixture-of-Experts (MoE), Interpretability, Sparse Autoencoders (SAE), Monosemanticity, Knowledge Preservation, Selective Fine-tuning

TL;DR: We use the sparsity of MoE models to identify key experts via interpretability analysis, then fine-tune only them. This achieves strong task performance while maintaining other capabilities.

Abstract: Large language models (LLMs) with Mixture-of-Experts (MoE) architectures have emerged as a promising approach for enhancing scalability and efficiency, with minimal performance degradation across diverse downstream tasks. However, the interpretability of experts and efficient post-training methods of domain experts remain understudied. In this paper, we first analyze the expert-level monosemanticity of MoE based on the sparse autoencoder (SAE), thereby facilitating a deeper understanding of domain experts' roles. Additionally, leveraging the enhanced monosemanticity induced by the sparse activations of MoE LLMs, we propose a new fine-tuning strategy that freezes domain-agnostic experts in specific layers. Unlike dense LLMs, the sparsity of MoE enables experts to exhibit stronger expert-level monosemantic behavior, allowing us to identify experts responsible for particular downstream tasks and freeze those unrelated during post-training. By only updating domain-relevant experts, our method mitigates the risk of catastrophic forgetting in other domains and reduces computational costs. Empirically, we apply this strategy to supervised fine-tuning of MoE models on tool-use data. Results show that monosemanticity-guided tuning achieves performance comparable to fully-tuned models on tool-use tasks, while preserving better performance in other domains. Our study provides an interpretability-guided strategy for understanding and finetuning MoE LLMs while alleviating performance degradation across domains.

Primary Area: foundation or frontier models, including LLMs

Submission Number: 8889

Loading