Navigating the Design Space of MoE LLM Inference Optimization

ACL ARR 2025 May Submission4725 Authors

20 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Mixture of Experts architectures have emerged as a powerful design for large language models, offering state-of-the-art performance through computational sparsity. Despite improved efficiency, their high memory demands hinder deployment on commonly available single-GPU systems. Existing approaches to mitigate this issue, including pruning, distillation, and quantization, often sacrifice model quality or increase inference latency. Recent MoE-specific strategies introduce dynamic expert offloading to DRAM, significantly reducing memory usage without degrading performance. We evaluate and compare leading MoE optimization techniques, analyzing their memory, latency, and quality trade-offs. Building on these insights, we propose an automated MoE serving system that adaptively selects optimal configurations to meet diverse deployment constraints. This enables efficient, high-quality LLM inference on limited hardware resources.
Paper Type: Short
Research Area: Efficient/Low-Resource Methods for NLP
Research Area Keywords: LLM Efficiency, NLP in resource-constrained settings
Contribution Types: Model analysis & interpretability, NLP engineering experiment, Approaches low compute settings-efficiency, Surveys
Languages Studied: English
Submission Number: 4725
Loading