Oracle-MoE: Locality-preserving Routing in the Oracle Space for Memory-constrained Large Language Model Inference

Jixian Zhou; Fang Dong; Ruijun Huang; Hengjie Cao; Mengyi Chen; Yifeng Yang; Anrui Chen; Mingzhi Dong; Yujiang Wang; Dongsheng Li; David A. Clifton; Qin Lv; Rui Zhu; Chun Zhang; Fan Yang; Tun Lu; Ning Gu; Li Shang

Oracle-MoE: Locality-preserving Routing in the Oracle Space for Memory-constrained Large Language Model Inference

Jixian Zhou, Fang Dong, Ruijun Huang, Hengjie Cao, Mengyi Chen, Yifeng Yang, Anrui Chen, Mingzhi Dong, Yujiang Wang, Dongsheng Li, David A. Clifton, Qin Lv, Rui Zhu, Chun Zhang, Fan Yang, Tun Lu, Ning Gu, Li Shang

Published: 01 May 2025, Last Modified: 23 Jul 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

TL;DR: Oracle-MoE optimizes expert activation consistency and reduces inference latency by leveraging semantic locality, enabling efficient deployment of large language models on memory-constrained edge devices.

Abstract: Mixture-of-Experts (MoE) is widely adopted to deploy Large Language Models (LLMs) on edge devices with limited memory budgets. Although MoE is, in theory, an inborn memory-friendly architecture requiring only a few activated experts to reside in the memory for inference, current MoE architectures cannot effectively fulfill this advantage and will yield intolerable inference latencies of LLMs on memory-constrained devices. Our investigation pinpoints the essential cause as the remarkable temporal inconsistencies of inter-token expert activations, which generate overly frequent expert swapping demands dominating the latencies. To this end, we propose a novel MoE architecture, Oracle-MoE, to fulfill the real on-device potential of MoE-based LLMs. Oracle-MoE route tokens in a highly compact space suggested by attention scores, termed the *oracle space*, to effectively maintain the semantic locality across consecutive tokens to reduce expert activation variations, eliminating massive swapping demands. Theoretical analysis proves that Oracle-MoE is bound to provide routing decisions with better semantic locality and, therefore, better expert activation consistencies. Experiments on the pretrained GPT-2 architectures of different sizes (200M, 350M, 790M, and 2B) and downstream tasks demonstrate that without compromising task performance, our Oracle-MoE has achieved state-of-the-art inference speeds across varying memory budgets, revealing its substantial potential for LLM deployments in industry.

Lay Summary: Large language models (LLMs) are very powerful but hard to run on small devices such as smartphones, since these models usually need a lot of memory, slowing down their operation on memory-limited hardware. Mixture-of-Experts (MoE) works by activating only small parts (called "experts") of the model when needed, which are inborn memory-friendly architectures on limited memory budgets. But current MoE methods often move experts in and out of memory repeatedly. This frequent switching introduces intolerable inference latencies . We found that the reason for this problem lies in how MoE decides which expert to use. Traditional MoE assigns experts based on rapidly changing token information. These quick changes lead to frequent and inefficient expert swapping, finally intolerable memory communication pressure. To fix this, we introduced a new method called Oracle-MoE. Oracle-MoE chooses experts by using a simpler grouping method we call "oracle space." Oracle space groups tokens with similar semantics together, greatly reducing how often experts switch. Our experiments show that Oracle-MoE makes the model run much faster without losing accuracy. We also provide detailed explanations and instructions for implementation. This helps other researchers easily use powerful language models on devices with limited memory.

Primary Area: Deep Learning->Large Language Models

Keywords: MoE; Edge Device; Inference Latency; Semantic Space

Submission Number: 3500

Loading