Towards Global Expert-Level Mixed-Precision Quantization for Mixture-of-Experts LLMs

Jianing Deng; Song Wang; Dongwei Wang; Zijie Liu; Tianlong Chen; Huanrui Yang; Jingtong Hu

Towards Global Expert-Level Mixed-Precision Quantization for Mixture-of-Experts LLMs

Jianing Deng, Song Wang, Dongwei Wang, Zijie Liu, Tianlong Chen, Huanrui Yang, Jingtong Hu

16 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Mixture-of-Experts, LLMs, Quantization

Abstract: Mixture-of-Experts large language models (MoE-LLMs) achieve state-of-the-art performance across diverse language tasks but incur substantial memory overhead due to their massive expert parameters. Mixed-precision quantization, which allocates different bit-widths to experts according to their importance, has emerged as a promising technique for reducing the memory consumption of MoE-LLMs. However, we identify two key limitations in existing MoE-LLMs quantization methods: (1) expert importance is estimated only locally within each MoE layer, failing to capture global importance across the model and leading to suboptimal bit-width allocation; and (2) expert quantization substantially alters the dynamics of MoE routers, yet this effect is often overlooked, resulting in suboptimal routing. In this work, we propose Global Expert-level Mixed-precision Quantization (GEMQ) to overcome these limitations and enable extreme low-bit quantization. First, we introduce a global expert bit-width allocation method that formulates a linear programming model based on quantization error analysis to capture global expert importance. Second, we propose an efficient global router fine-tuning approach that adapts routers to quantized experts, enabling optimal routing. Additionally, we integrate the two techniques into a progressive quantization framework that leverages the previously quantized and fine-tuned model for expert importance estimation, enabling more accurate allocation and improved performance. Extensive experiments show that our approach substantially reduces memory usage and improves inference speed while incurring minimal performance degradation.

Primary Area: foundation or frontier models, including LLMs

Submission Number: 7529

Loading