Keywords: Expert Pruning, MoE, LLM, Expert Loading balance
Abstract: While Mixture-of-Experts (MoE) Large Language Models (LLMs) achieve higher accuracy with fewer active parameters, their pre-training remains challenging due to the enormous parameter sizes and low training efficiency caused by imbalanced expert routing. Unlike previous expert pruning methods that focus on the post-training phase, this paper proposes an efficient Expert Pruning Algorithm (EPA) for the pre-training of MoE LLMs. This algorithm enhances training efficiency while preserving model accuracy by pruning underutilized experts and rearranging experts within expert parallel groups based on token distribution. Extensive experimental results demonstrate that EPA can significantly reduce model size and improve training efficiency while maintaining nearly unchanged accuracy. Specifically, a 1010B parameter MoE LLM trained from scratch using EPA exhibits substantial improvements in training efficiency and delivers excellent performance across tasks in various domains. The code and the 1010B model will be made publicly available.
Primary Area: generative models
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2026/AuthorGuide.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 25312
Loading