Elbow-based MoE Routing: A Training-Free Inference Time Plugin for Expert Selection

Published: 01 Mar 2026, Last Modified: 05 Apr 2026TTU at ICLR 2026 (Main) OralEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Mixture-of-Experts (MoE) models enable model scaling while maintaining low inference-time compute by activating only a subset of experts per token. However, conventional routing relies on a fixed top-k selection, forcing the model to spend the same compute regardless of how many experts are relevant. We introduce elbow-based routing, a training-free inference-time modification that dynamically adjusts the number of experts on a per-token basis. Our method examines the sorted router probability distribution and identifies an elbow point that separates high- and low-probability experts. We find that most router distributions exhibit clear inflection points suitable for this strategy, and we show both theoretically and empirically that elbow-based routing preserves expert load balance. Experiments on a state-of-the-art MoE model demonstrate an average latency reduction of 5.3\% while maintaining accuracy across six benchmarks.
Submission Number: 61
Loading