One Must Imagine Experts Happy: Rebalancing Neural Routers via Constrained Optimization

Kushal Thaman

One Must Imagine Experts Happy: Rebalancing Neural Routers via Constrained Optimization

Kushal Thaman

Published: 05 Mar 2025, Last Modified: 22 Apr 2025SLLMEveryoneRevisionsBibTeXCC BY 4.0

Track: long paper (up to 4 pages)

Keywords: mixture of experts

TL;DR: We introduce a dual ascent–based constrained optimization framework that rebalances MoE routing via bias updates and adaptive sparsemax gating, achieving uniform expert loads without interfering gradients.

Abstract: Mixture-of-Experts models promise scalable capacity by routing tokens to a sparse set of expert networks, but imbalanced routing (i.e. routing collapse) can degrade performance and hinder distributed expert parallelism. Conventional remedies add an auxiliary load-balancing loss, but this introduces conflicting gradients that hamper primary optimization. Recent bias-based routing strategies avoid auxiliary losses by dynamically adjusting per-expert biases that are updated using the sign of deviation from mean load. This requires careful tuning, is still susceptible to load fluctuations and suboptimal load utilization, and can lead to oscillations ("gating thrash") if mistuned. We propose Dual Unified Ascent for Load-balancing (DUAL), a technique that recasts router load balancing as a constrained optimization problem. By learning per-expert bias updates derived from Lagrange dual variables, DUAL adjusts gating bias increments proportional to each expert's load error with a damping term that prevents bias overshoot, combined with a differentiable router using a sparsemax function for the gating logits. Experimental results on large language models demonstrate that DUAL attains more uniform expert utilization without sacrificing performance, consistently reduces router imbalance, and slightly outperforms the performance of state-of-the-art Mixture-of-Expert techniques.

Anonymization: This submission has been anonymized for double-blind review via the removal of identifying information such as names, affiliations, and identifying URLs.

Submission Number: 89

Loading