TL;DR: We propose MxMoE, automatically discover mixed-precision quantization schemes and generates efficient kernels for parallel GPU execution, improving both quantization accuracy and computational efficiency.
Abstract: Mixture-of-Experts (MoE) models face deployment challenges due to their large parameter counts and computational demands. We explore quantization for MoE models and highlight two key insights: 1) linear blocks exhibit varying quantization sensitivity, and 2) divergent expert activation frequencies create heterogeneous computational characteristics. Based on these observations, we introduce MxMoE, a mixed-precision optimization framework for MoE models that considers both algorithmic and system perspectives. MxMoE navigates the design space defined by parameter sensitivity, expert activation dynamics, and hardware resources to derive efficient mixed-precision configurations. Additionally, MxMoE automatically generates optimized mixed-precision GroupGEMM kernels, enabling parallel execution of GEMMs with different precisions. Evaluations show that MxMoE outperforms existing methods, achieving 2.4 lower Wikitext-2 perplexity than GPTQ at 2.25-bit and delivering up to 3.4x speedup over full precision, as well as up to 29.4% speedup over uniform quantization at equivalent accuracy with 5-bit weight-activation quantization. Our code is available at https://github.com/cat538/MxMoE.
Lay Summary: Large AI models, like Mixture-of-Experts (MoE), are highly effective but often too bulky and slow for everyday use. These models consist of multiple specialized components, or "experts," each handling different parts of a task. While this design boosts performance, it also leads to high memory usage and slower processing speeds.
Our work introduces MxMoE, a new method that makes these large models faster and more efficient. MxMoE intelligently reduces different parts of the model by using lower-precision data type, a technique called quantization. What's unique about our approach is that it considers how often each expert is used and how sensitive each part is to quantization, allowing for smarter, more targeted reductions.
We also developed specialized software that enables different parts of the model to run simultaneously, even if they're using different levels of precision. This model accuracy and computational efficiency co-design innovation leads to significant improvements: MxMoE can run up to 3.4 times faster than the original full-size model and outperforms existing methods in accuracy.
By making powerful AI models more accessible and efficient, MxMoE opens the door to using advanced AI in a wider range of applications, from personal PC to smaller servers. Our implementation is open-source and available online.
Link To Code: https://github.com/cat538/MxMoE
Primary Area: Deep Learning->Large Language Models
Keywords: Quantization, Mixture-of-Expert, LLMs, Efficiency
Flagged For Ethics Review: true
Submission Number: 9526
Loading