BT-MoE: A Budget-Aware Tuning Framework for Joint Bit–Rank Allocation in MoE Models

Lei Jiang; Jingwei Sun; Junqing Lin; Han Li; Guangzhong Sun; Yinghua Zhou

BT-MoE: A Budget-Aware Tuning Framework for Joint Bit–Rank Allocation in MoE Models

Lei Jiang, Jingwei Sun, Junqing Lin, Han Li, Guangzhong Sun, Yinghua Zhou

17 Sept 2025 (modified: 23 Dec 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: LLMs, MoE, Model Compression, Mixed-Precision Quantization, Combinatorial Optimization

Abstract: Quantization is a critical approach for efficiently deploying Mixture-of-Experts (MoE) models with massive parameters. However, MoE models suffer from non-negligible accuracy loss with extreme quantization, such as under 4 bits. To address this, we introduce BT-MoE , a novel framework that achieves a unified and globally optimal allocation of mixed-precision bit-widths and low-rank compensator configurations. Our key insight is to formalize this co-design problem as a Multiple-Choice Knapsack Problem (MCKP). To make this NP-hard problem computationally feasible, we further propose an efficient proxy metric based on layer-wise quantization loss for rapid configuration impact assessment, so that a standard Integer Linear Programming (ILP) solver can solve the MCKP within a practical time. Our comprehensive evaluation demonstrates that BT-MoE consistently outperforms state-of-the-art quantization methods across various MoE models and benchmarks. By systematically exploring the design space, BT-MoE achieves superior accuracy-memory trade-offs, significantly improving the deployability of large MoE models on resource-constrained hardware.

Supplementary Material: zip

Primary Area: applications to computer vision, audio, language, and other modalities

Submission Number: 8830

Loading