Keywords: Mixture of Experts System, PTQ
TL;DR: Improving the Accuracy of Post Training Quantitation Models Using a Mixture of Experts System
Abstract: Quantization method plays a crucial role in improving model efficiency and reducing
deployment costs, enabling the widespread application of deep learning
models on resource-constrained devices. However, the quantization process inevitably
introduces accuracy degradation. In this paper, we propose Mixture of
Quantization Experts( abbr. MoQE), a quantization inference framework based
on the Mixture-of-Experts (MoE) architecture, aiming to jointly improve the performance
of quantization models. MoQE combines multiple quantization variants
of one full-precision model as specialized ”quantization experts” and dynamically
routes input data to the most suitable expert based on its characteristics. MoQE alleviates
the performance degradation commonly seen in single quantization models
through specialization quantization expert models. We design lightweight,
structure-aware router models tailored for both CV and NLP tasks. Experimental
evaluations on ResNet, LLaMA, and Qwen model families across benchmark
datasets including ImageNet, WikiText, C4, and OpenWebText demonstrate that
MoQE achieves performance comparable to SOTA quantization model, without
incurring significant increases in inference latency.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 8233
Loading