Navigating the Design Space of MoE LLM Inference Optimization

Navigating the Design Space of MoE LLM Inference Optimization

ACL ARR 2025 May Submission4725 Authors

20 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Mixture of Experts architectures have emerged as a powerful design for large language models, offering state-of-the-art performance through computational sparsity. Despite improved efficiency, their high memory demands hinder deployment on commonly available single-GPU systems. Existing approaches to mitigate this issue, including pruning, distillation, and quantization, often sacrifice model quality or increase inference latency. Recent MoE-specific strategies introduce dynamic expert offloading to DRAM, significantly reducing memory usage without degrading performance. We evaluate and compare leading MoE optimization techniques, analyzing their memory, latency, and quality trade-offs. Building on these insights, we propose an automated MoE serving system that adaptively selects optimal configurations to meet diverse deployment constraints. This enables efficient, high-quality LLM inference on limited hardware resources.

Paper Type: Short

Research Area: Efficient/Low-Resource Methods for NLP

Research Area Keywords: LLM Efficiency, NLP in resource-constrained settings

Contribution Types: Model analysis & interpretability, NLP engineering experiment, Approaches low compute settings-efficiency, Surveys

Languages Studied: English

Submission Number: 4725

Loading