Oracle-MoE: Locality-preserving Routing in the Oracle Space for Memory-constrained Large Language Model Inference

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0
TL;DR: Oracle-MoE optimizes expert activation consistency and reduces inference latency by leveraging semantic locality, enabling efficient deployment of large language models on memory-constrained edge devices.
Abstract: Mixture-of-Experts (MoE) is widely adopted to deploy Large Language Models (LLMs) on edge devices with limited memory budgets. Although MoE is, in theory, an inborn memory-friendly architecture requiring only a few activated experts to reside in the memory for inference, current MoE architectures cannot effectively fulfill this advantage and will yield intolerable inference latencies of LLMs on memory-constrained devices. Our investigation pinpoints the essential cause as the remarkable temporal inconsistencies of inter-token expert activations, which generate overly frequent expert swapping demands dominating the latencies. To this end, we propose a novel MoE architecture, Oracle-MoE, to fulfill the real on-device potential of MoE-based LLMs. Oracle-MoE route tokens in a highly compact space suggested by attention scores, termed the *oracle space*, to effectively maintain the semantic locality across consecutive tokens to reduce expert activation variations, eliminating massive swapping demands. Theoretical analysis proves that Oracle-MoE is bound to provide routing decisions with better semantic locality and, therefore, better expert activation consistencies. Experiments on the pretrained GPT-2 architectures of different sizes (200M, 350M, 790M, and 2B) and downstream tasks demonstrate that without compromising task performance, our Oracle-MoE has achieved state-of-the-art inference speeds across varying memory budgets, revealing its substantial potential for LLM deployments in industry.
Lay Summary: Large language models (LLMs) are very powerful but hard to run on small devices such as smartphones, since these models usually need a lot of memory, slowing down their operation on memory-limited hardware. Mixture-of-Experts (MoE) works by activating only small parts (called "experts") of the model when needed, which are inborn memory-friendly architectures on limited memory budgets. But current MoE methods often move experts in and out of memory repeatedly. This frequent switching introduces intolerable inference latencies . We found that the reason for this problem lies in how MoE decides which expert to use. Traditional MoE assigns experts based on rapidly changing token information. These quick changes lead to frequent and inefficient expert swapping, finally intolerable memory communication pressure. To fix this, we introduced a new method called Oracle-MoE. Oracle-MoE chooses experts by using a simpler grouping method we call "oracle space." Oracle space groups tokens with similar semantics together, greatly reducing how often experts switch. Our experiments show that Oracle-MoE makes the model run much faster without losing accuracy. We also provide detailed explanations and instructions for implementation. This helps other researchers easily use powerful language models on devices with limited memory.
Primary Area: Deep Learning->Large Language Models
Keywords: MoE; Edge Device; Inference Latency; Semantic Space
Submission Number: 3500
Loading