Binary-Integer-Programming Based Algorithm for Expert Load Balancing in Mixture-of-Experts Models

Binary-Integer-Programming Based Algorithm for Expert Load Balancing in Mixture-of-Experts Models

ICLR 2026 Conference Submission7723 Authors

16 Sept 2025 (modified: 26 Jan 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: integer programming, load balance, mixture-of-experts model

TL;DR: We propose a new expert load balancing algorithm for MoE model pre-training based on binary integer programming, with the lowest perplexities and at least 13% of pre-training time saved.

Abstract: For pre-training of MoE (Mixture-of-Experts) models, one of the main issues is unbalanced expert loads, which may cause routing collapse or increased computational overhead. Existing methods contain the Loss-Controlled method and the Loss-Free method, where both the unbalanced degrees at first several training steps are still high and decrease slowly. In this work, we propose BIP-Based Balancing, an expert load balancing algorithm based on binary integer programming (BIP). The algorithm maintains an additional vector q on each MoE layer that can help change the top-K order of s by solving a binary integer programming with very small time costs. We implement the algorithm on two MoE language models: 16-expert (0.3B) and 64-expert (1.1B). The experimental results show that on both models comparing with the Loss-Controlled method and the Loss-Free method, our algorithm trains models with the lowest perplexities, while saves at least 13% of pre-training time compared with the Loss-Controlled method. Within our current knowledge, this is the first routing algorithm that achieves maintaining load balance status on every expert in every MoE layer from the first step to the last step during the whole pre-training process, while the trained MoE models also perform well.

Supplementary Material: zip

Primary Area: foundation or frontier models, including LLMs

Submission Number: 7723

Loading