BiMoE:Pushing the Limit of Post-Training Quantization for MoE-based LLMs

02 Sept 2025 (modified: 12 Feb 2026)ICLR 2026 Conference Desk Rejected SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: MoE-based LLMs, Binarization
Abstract: Large language models (LLMs) with Mixture-of-Experts (MoE) architectures have achieved remarkable progress in natural language processing, yet their massive memory and compute costs hinder practical deployment. Binarization, which compresses model weights to 1 bit, yields extreme efficiency, offers an extreme efficiency advantage. However, existing methods that primarily target dense LLMs are not well suited to address MoE-specific quantization challenges, including redundant expert representations, task-unaware weight-importance scoring, and quantization-induced expert-shift. To this end, we propose BiMoE, the first binarization framework tailored for MoE-based LLMs. BiMoE is built on three core innovations: 1) using joint SVD decomposition to reduce cross-expert redundancy; 2) integrating global loss gradients into local Hessian metrics to enhance weight importance estimation; 3) introducing an error constraint guided by the input null space to mitigate routing distortion. Notably, BiMoE achieves these optimizations while incurring no additional storage overhead, striking a balance between efficiency and model performance. Extensive experiments demonstrate that BiMoE consistently outperforms state-of-the-art binary methods across multiple MoE-based LLMs and benchmarks. For example, on Qwen3-30B-A3B, BiMoE reduces perplexity by 52.2\%, improves average zero-shot performance by 43.4\%, achieves over 2 × inference speedup, and further shortens quantization time. The code is available at https://anonymous.4open.science/r/BiMoE-CADF/.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 767
Loading