From Score Distributions to Balance:  Plug-and-Play Mixture-of-Experts Routing

Rana Shahout; Colin Cai; Yilun Du; Minlan Yu; Michael Mitzenmacher

From Score Distributions to Balance: Plug-and-Play Mixture-of-Experts Routing

Rana Shahout, Colin Cai, Yilun Du, Minlan Yu, Michael Mitzenmacher

18 Sept 2025 (modified: 12 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Mixture-of-Experts, Load Balancing, Inference Optimization, Expert Routing, Large Language Models

TL;DR: LASER adapts to gate score distributions at inference, reducing MoE imbalance without sacrificing accuracy; best of all, it drops in plug-and-play, no retraining needed.

Abstract: Mixture-of-Experts (MoE) models can scale parameter capacity by routing each token to a subset of experts through a learned gate function. While conditional routing reduces training costs, it shifts the burden on inference memory: expert parameters and activations consume memory, limiting the number of experts per device. As tokens are routed, some experts become overloaded while others are underutilized. Because experts are mapped to GPUs, this imbalance translates di- rectly into degraded system performance in terms of latency, throughput, and cost. We present LASER, a plug-and-play, inference-time routing algorithm that bal- ances load while preserving accuracy. LASER adapts to the shape of the gate’s score distribution. When scores provide a clear preference, it routes to the strongest experts; when scores are more uniform, it broadens the set of viable ex- perts and routes to the least-loaded among them. Because LASER relies only on gate scores from a trained model, it integrates directly into existing MoE inference pipelines without retraining or finetuning. We evaluate LASER on Mixtral-8×7B and DeepSeek-MoE-16b-chat across four datasets (ARC-Easy, ARC-Challenge, MMLU, and GSM8K). LASER improves load balancing, translating into lower latency and higher throughput, while keeping accuracy changes negligible

Supplementary Material: zip

Primary Area: foundation or frontier models, including LLMs

Submission Number: 14359

Loading