Drop or Merge? Hybrid MoE LLMs Compressors via Metric-Driven Adaptive Allocation

Lujun Li; Dezhi Li; Qiyuan Zhu; Jiacheng Wang; Xiaoyu Qin; Wei Li; Hao Gu; Sirui Han; Yike Guo

Drop or Merge? Hybrid MoE LLMs Compressors via Metric-Driven Adaptive Allocation

Lujun Li, Dezhi Li, Qiyuan Zhu, Jiacheng Wang, Xiaoyu Qin, Wei Li, Hao Gu, Sirui Han, Yike Guo

02 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Mixture-of-Experts, MoE compression, Expert dropping, Expert merging, Adaptive layer-wise allocation, Language model efficiency

TL;DR: We present a hybrid MoE compression framework that first drops unimportant experts, then merges remaining experts through metric-driven adaptive allocation, achieving superior performance-efficiency trade-offs.

Abstract: Mixture-of-Experts (MoE) models enhance the scalability of large language models but encounter deployment challenges due to their vast parameter counts. Existing compression methods either drop experts entirely (discarding valuable knowledge) or merge experts (suffering from parameter conflicts), typically employing uniform strategies that ignore the heterogeneous specialization patterns across layers. In this paper, we propose DM-MoE, an adaptive Drop-then-Merge MoE compression framework to address these limitations. Our approach is motivated by two key observations: first, that eliminating a small number of truly redundant experts facilitates more effective subsequent merging, and second, that expert functional redundancy and behavioral similarity serve as reliable indicators for adaptive compression throughout MoE architectures. Building on these insights, we develop a two-stage compression: (1) In the dropping phase, we quantify layer redundancy via mutual information between expert outputs and formulate a constrained optimization problem to derive layer-wise dropping budgets, then select experts based on output impact assessment to retain those with high functional significance. (2) In the merging phase, we adaptively determine the number of expert groups per layer using behavioral diversity metrics, partition experts into functionally similar clusters via graph-based optimization, and merge them using importance-weighted averaging based on activation frequency and output deviation. Comprehensive evaluations on Mixtral, Qwen, DeepSeek and GPT-OSS MoE demonstrate that our DM-MoE surpasses state-of-the-art methods across models and compression ratios. For Mixtral-8×7B, we retain 96.5\%/89.1\% of original performance at 25\%/50\% expert reduction. Code is available in the Appendix.

Supplementary Material: zip

Primary Area: foundation or frontier models, including LLMs

Submission Number: 698

Loading