Keywords: Mixture-of-Experts (MoE), Interpretability, Sparse Autoencoders (SAE), Monosemanticity, Knowledge Preservation, Selective Fine-tuning
TL;DR: We use the sparsity of MoE models to identify key experts via interpretability analysis, then fine-tune only them. This achieves strong task performance while maintaining other capabilities.
Abstract: Large language models (LLMs) with Mixture-of-Experts (MoE) architectures have emerged as a promising approach for enhancing scalability and efficiency, with minimal performance degradation across diverse downstream tasks. However, the interpretability of experts and efficient post-training methods of domain experts remain understudied. In this paper, we first analyze the expert-level monosemanticity of MoE based on the sparse autoencoder (SAE), thereby facilitating a deeper understanding of domain experts' roles. Additionally, leveraging the enhanced monosemanticity induced by the sparse activations of MoE LLMs, we propose a new fine-tuning strategy that freezes domain-agnostic experts in specific layers. Unlike dense LLMs, the sparsity of MoE enables experts to exhibit stronger expert-level monosemantic behavior, allowing us to identify experts responsible for particular downstream tasks and freeze those unrelated during post-training. By only updating domain-relevant experts, our method mitigates the risk of catastrophic forgetting in other domains and reduces computational costs. Empirically, we apply this strategy to supervised fine-tuning of MoE models on tool-use data. Results show that monosemanticity-guided tuning achieves performance comparable to fully-tuned models on tool-use tasks, while preserving better performance in other domains. Our study provides an interpretability-guided strategy for understanding and finetuning MoE LLMs while alleviating performance degradation across domains.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 8889
Loading