Keywords: local routing consistency, MoE analysis, expert offloading
TL;DR: We introduce *local routing consistency* as a critical property for efficient expert offloading, conduct empirical analysis across various MoE LLMs, and provide practical insights for MoE architecture and cache system design.
Abstract: Mixture-of-Experts (MoE) enables efficient scaling of large language models (LLMs) with sparsely activated experts during inference.
To effectively deploy large MoE models on memory-constrained devices, many systems introduce expert offloading which caches a subset of experts in fast memory, leaving others on slow memory to run on CPU or load on demand.
While some research has exploited the locality of expert activations, where consecutive tokens activate similar experts, the degree of this **local routing consistency** varies across models and remains understudied.
In this paper, we propose two metrics to measure local routing consistency of MoE models:
(1) **Segment Routing best Performance (SRP)**, which evaluates how well a fixed group of experts can cover the needs of a segment of tokens, and
(2) **Segment Cache best Hit rate (SCH)**, which measures the hit rate of an expert cache utilizing a length of future information under a cache limit.
We analyze 20 MoE LLMs with diverse sizes and architectures and use toy models to verify key factors related to local routing consistency.
We find a strong trade-off between local routing consistency and *local* load balance, while showing that *global* load balance can coexist with local routing consistency.
Meanwhile, settings like shared experts that decrease expert combination space can lead to low local routing consistency.
We further reveal that domain-specialized experts contribute more to routing consistency than vocabulary-specialized ones, and that most models balance between cache effectiveness and efficiency with cache sizes approximately twice the active experts.
These findings pave the way for memory-efficient MoE design and deployment without compromising inference speed.
Supplementary Material: zip
Primary Area: foundation or frontier models, including LLMs
Submission Number: 14230
Loading