MoE-Infinity: Efficient MoE Inference on Personal Machines with Sparsity-Aware Expert Cache

Leyang Xue; Yao Fu; Zhan Lu; Chuanhao Sun; Luo Mai; Mahesh K. Marina

MoE-Infinity: Efficient MoE Inference on Personal Machines with Sparsity-Aware Expert Cache

Leyang Xue, Yao Fu, Zhan Lu, Chuanhao Sun, Luo Mai, Mahesh K. Marina

09 Jan 2025 (modified: 18 Jun 2025)Submitted to ICML 2025EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: This paper presents MoE-Infinity, an efficient MoE inference system designed for personal machines with limited GPU memory capacity. The key idea for MoE-Infinity is that on personal machines, which are often single-user environments, MoE-based LLMs typically operate with a batch size of one. In this setting, MoE models exhibit a high degree of activation sparsity, meaning a small number of experts are frequently reused in generating tokens during the decode phase. Leveraging this idea, we design a sparsity-aware expert cache, which can trace the sparse activation of experts during inference and carefully select the trace that represents the sparsity pattern. By analyzing these selected traces, MoE-Infinity guides the replacement and prefetching of the expert cache, providing 2.7–13.7× per-token latency improvements over numerous state-of-the-art systems, including vLLM, Ollama, DeepSpeed and BrainStorm across various MoE models (DeepSeek and Mixtral) when handling different LLM tasks.

Primary Area: General Machine Learning->Hardware and Software

Keywords: Machine Learning System, Mixture-of-Expert, MoE, Large Language Model, LLM

Submission Number: 316

Loading