Keywords: LLMs, MoE, Model Compression, Mixed-Precision Quantization, Combinatorial Optimization
Abstract: Quantization is a critical approach for efficiently deploying Mixture-of-Experts (MoE) models with massive parameters. However, MoE models suffer from non-negligible accuracy loss with extreme quantization, such as under 4 bits. To address this, we introduce BT-MoE , a novel framework that achieves a unified and globally optimal allocation of mixed-precision bit-widths and low-rank compensator configurations. Our key insight is to formalize this co-design problem as a Multiple-Choice Knapsack Problem (MCKP). To make this NP-hard problem computationally feasible, we further propose an efficient proxy metric based on layer-wise quantization loss for rapid configuration impact assessment, so that a standard Integer Linear Programming (ILP) solver can solve the MCKP within a practical time. Our comprehensive evaluation demonstrates that BT-MoE consistently outperforms state-of-the-art quantization methods across various MoE models and benchmarks. By systematically exploring the design space, BT-MoE achieves superior accuracy-memory trade-offs, significantly improving the deployability of large MoE models on resource-constrained hardware.
Supplementary Material: zip
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 8830
Loading