Router Choice Matters: Rank-Aware Post-Training Quantization for MoE Models

13 Sept 2025 (modified: 03 Dec 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Mixture-of-Experts, Post-training Quantization, Large Language Models
Abstract: Quantizing Mixture-of-Experts (MoE) language models is challenging since router errors cascade into expert selection and dominate accuracy loss. We study this effect and show that preserving router decisions of the selected experts yields the largest gains, with most errors arising as near-neighbor rank flips around the top-$k$ experts. Motivated by these observations, we present ExpertQuant, a training-free, calibration-only post-training quantization (PTQ) framework tailored to MoE. ExpertQuant combines (i) Expert-Aware Scale to accommodate heterogeneous activation ranges and two router-alignment objectives between quantized and full-precision models: (ii) Rank-Aware Jaccard Loss, which aligns the top-$k$ expert rank, and (iii) Gap Hinge Loss, which preserves score margins between consecutive experts to suppress rank flipping. Across OLMoE, DeepSeek-MoE, and Qwen3-MoE, ExpertQuant consistently reduces perplexity on C4 and WikiText-2 and improves zero-shot accuracy under W4A4 and W4A8, with similar trends at lower bit-widths. The framework requires no retraining, integrates seamlessly with existing MoE, and demonstrates that stabilizing router rankings during calibration is key to accurate low-bit MoE inference.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 4812
Loading