CompeteSMoE - Statistically Guaranteed Mixture of Experts Training via Competition

ICLR 2026 Conference Submission15638 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Mixture of Experts, Large Language Models
TL;DR: CompeteSMoE introduces a competition mechanism for efficient Sparse Mixture of Experts training, improving routing with higher neural response, achieving better sample efficiency, and delivering strong performance in visual and language tasks.
Abstract: Sparse mixture of experts (SMoE) offers an appealing solution to scale up the model complexity beyond the mean of increasing the network's depth or width. However, we argue that effective SMoE training remains challenging because of the suboptimal routing process where experts that perform computation do not directly contribute to the routing process. In this work, we propose competition, a novel mechanism to route tokens to experts with the highest neural response. Theoretically, we show that the competition mechanism enjoys a better sample efficiency than the traditional softmax routing. Furthermore, we develop CompeteSMoE, a simple yet effective algorithm to train large language models by deploying a router to learn the competition policy, thus enjoying strong performances at a low training overhead. Our extensive empirical evaluations on both the visual instruction tuning and language pre-training tasks demonstrate the efficacy, robustness, and scalability of CompeteSMoE compared to state-of-the-art SMoE strategies. We will publish the implementation upon acceptance.
Supplementary Material: zip
Primary Area: foundation or frontier models, including LLMs
Submission Number: 15638
Loading