Abstract: Mixture of Experts architectures have emerged as a powerful design for large language models, offering state-of-the-art performance through computational sparsity.
Despite improved efficiency, their high memory demands hinder deployment on commonly available single-GPU systems.
Existing approaches to mitigate this issue, including pruning, distillation, and quantization, often sacrifice model quality or increase inference latency. Recent MoE-specific strategies introduce dynamic expert offloading to DRAM, significantly reducing memory usage without degrading performance.
We evaluate and compare leading MoE optimization techniques, analyzing their memory, latency, and quality trade-offs.
Building on these insights, we propose an automated MoE serving system that adaptively selects optimal configurations to meet diverse deployment constraints.
This enables efficient, high-quality LLM inference on limited hardware resources.
Paper Type: Short
Research Area: Efficient/Low-Resource Methods for NLP
Research Area Keywords: LLM Efficiency, NLP in resource-constrained settings
Contribution Types: Model analysis & interpretability, NLP engineering experiment, Approaches low compute settings-efficiency, Surveys
Languages Studied: English
Submission Number: 4725
Loading