Retraining-free Merging of Sparse MoE via Hierarchical Clustering

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0
TL;DR: Our method (HC-SMoE) offers an efficient method for merging experts of large Sparse Activated Mixture of Experts (SMoE) models without retraining under task-agnostic settings.
Abstract: Sparse Mixture-of-Experts (SMoE) models represent a significant advancement in large language model (LLM) development through their efficient parameter utilization. These models achieve substantial performance improvements at reduced inference costs. However, the deployment of SMoE models faces constraints from extensive memory requirements of expert components in resource-limited environments. To address these limitations, this paper introduces Hierarchical Clustering for Sparsely activated Mixture of Experts (HC-SMoE), a task-agnostic expert merging framework for parameter reduction without retraining. HC-SMoE introduces a novel hierarchical clustering approach based on expert outputs to ensure merging robustness independent of routing decisions. The proposed output-based clustering method enables effective capture of functional relationships between experts for large- scale architectures. We provide theoretical analysis and comprehensive evaluations across multiple zero-shot language tasks to demonstrate HC-SMoE’s effectiveness in state-of-the-art models including Qwen and Mixtral. The experimental results validate HC-SMoE’s superior performance and practical applicability for real-world deployments. Our implementation is available at https://github.com/wazenmai/HC-SMoE.
Lay Summary: Modern language technologies rely on very large systems that can generate human-like text. To make these systems faster and more efficient, researchers often divide them into smaller expert components, where only a few are used at a time. This design saves computation, but the storage requirements for all the expert components still remain high, which limits deployment in memory-constrained environments. This research introduces a method to reduce the number of expert components without rebuilding the system from scratch. The key idea is to merge similar experts by analyzing the way they behave when given the same input. To do this, we use a process called hierarchical clustering that progressively groups experts based on how similarly they respond to the same input. We demonstrate that this approach maintains strong performance across a wide range of language tasks while significantly reducing memory usage. This makes large-scale language technologies more accessible and easier to deploy in real-world applications.
Link To Code: https://github.com/wazenmai/HC-SMoE
Primary Area: Deep Learning->Large Language Models
Keywords: Sparse Mixture-of-Experts, Merging, Compression
Submission Number: 8395
Loading