Load Balancing Mixture of Experts with Similarity Preserving Routers

ICLR 2026 Conference Submission13113 Authors

18 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: mixture of experts, experts, routing, moe, large language models, llm, load balancing, language models
TL;DR: We replace traditional balancing losses that enforce uniform sequence-wise usage with a new loss, improving performance.
Abstract: Sparse Mixture of Experts (MoE) models offer a scalable and efficient architecture for training large neural networks by activating only a subset of parameters (“experts”) for each input. A learned router computes a distribution over these experts, and assigns input tokens to a small subset. However, without auxiliary balancing mechanisms, routers often converge to using only a few experts, severely limiting model capacity and degrading performance. Most current load balancing mechanisms encourage a distribution over experts that resembles a roughly uniform distribution of experts per token. During training, this can result in inconsistent routing behavior, resulting in the model spending its' capacity to learn redundant knowledge. We address this by introducing a novel load balancing loss that preserves token-wise relational structure, encouraging consistent expert choices for similar inputs during training. Our experimental results show that applying our loss to the router over a popular load balancing loss results in 35% faster convergence and lower redundancy, while removing balancing hyper-parameters completely.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 13113
Loading