Automated Fine-Grained Mixture-of-Experts Quantization

Zhanhao Xie; Yuexiao Ma; Xiawu Zheng; Fei Chao; Rongrong Ji

Automated Fine-Grained Mixture-of-Experts Quantization

Zhanhao Xie, Yuexiao Ma, Xiawu Zheng, Fei Chao, Rongrong Ji

13 Sept 2024 (modified: 21 Nov 2024)ICLR 2025 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: quantization

Abstract: Mixture of Experts (MoE) enables efficient parameter scaling in large language models by dynamically activating relevant parameter subsets per input token. Compressing MoE models presents unique challenges due to their inherent sparsity. Traditional quantization techniques, which are typically effective for dense models, prove inadequate when applied to MoE architectures. This paper proposes an efficient MoE quantization algorithm. We propose a fine-grained, adaptive quantization approach coupled with an efficient method for determining optimal configurations. Specifically, we construct a mixed-precision quantization search space encompassing different granularities from expert-level to channel-level. This approach facilitates precise bit-width resource allocation across model components based on their significance and activation frequency. And then, we leverage evolutionary algorithms to efficiently navigate this search space, autonomously identifying optimal quantization configurations. The synergy between adaptive granularity and automated search effectively mitigates the distinctive quantization challenges inherent to MoE models, culminating in a fully automated framework for efficient MoE quantization. Experimental results indicate that our method achieves significant performance improvements across multiple evaluation tasks, with particularly notable results in low-bit quantization scenarios. When applied to the Mixtral-8x7b-v0.1 model, our approach outperforms the current state-of-the-art by $9.24$\% , setting a new benchmark in MoE quantization. Code is available in supplementary materials.

Supplementary Material: zip

Primary Area: generative models

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 314

Loading