LibMoE: A Library for Comprehensive Research on Mixture of Experts in Large Language Models

LibMoE: A Library for Comprehensive Research on Mixture of Experts in Large Language Models

01 Feb 2026 (modified: 24 Apr 2026)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0

Abstract: Mixture of experts (MoE) architectures have become a cornerstone for scaling up and are a key component in most large language models such as GPT-OSS, DeepSeek-V3, Llama-4, and Gemini-2.5. However, systematic research on MoE remains severely constrained by the prohibitive computational costs of training and evaluation, restricting large-scale studies accessible to most researchers. We introduce LibMoE, a unified framework for reproducible, efficient, and extensible MoE research that supports both pretraining and sparse-upcycling regimes. Beyond unified implementations, the framework provides transparent analytical tools for probing routing and expert dynamics. Leveraging this foundation, we conduct a comprehensive analysis along three dimensions: (i) routing dynamics, covering expert selection patterns, routing stability and optimality, and how routing entropy reveals task specialization and expert diversity; (ii) the effect of lightweight initialization on load balancing, demonstrating how subtle changes in router initialization shape early expert utilization; and (iii) training regime differences, revealing how sparse upcycling and full pretraining exhibit distinct routing patterns and stability profiles. By lowering the barrier to entry and standardizing evaluation, along with our comprehensive analysis, LibMoE broadens access to MoE research and establishes a reliable benchmark to guide future innovations.

Submission Type: Long submission (more than 12 pages of main content)

Previous TMLR Submission Url: https://openreview.net/forum?id=k7n9Gj8nz4

Changes Since Last Submission: We thank the reviewers for their constructive feedback. In the revised manuscript, we made the following changes: * **Editorial corrections — throughout the manuscript.** We corrected punctuation issues and removed duplicate citation entries. * **Overview of benchmarked SMoE variants — Appendix A.** We added an overview of the seven benchmarked SMoE algorithms, summarizing their routing rules, structural differences, and design motivations. * **Sensitivity to the number of active experts — Appendix B, Figure 10.** We added a sensitivity analysis on the active-expert budget. In the VLM setting, we evaluate $$K \in \{1,2,3\}$$ with $$N=6$$. In the language-model pretraining setting, we evaluate $$K \in \{2,8,32\}$$ with $$N=66$$. * **Multi-seed results — Appendix C, Tables 3 and 4.** We added three-seed mean±std results for both the VLM benchmark and the language-modeling benchmark. * **Hybrid dense–sparse architecture experiment — Appendix D, Table 5.** We added a controlled comparison between Hybrid SharedE-V3 and fully sparse SharedE-V3 in the 5.67B VLM setting on LLaVA-665K. * **Large-scale Qwen3-VL-30B-A3B analysis — Section 5 opening paragraph and Section 5(c)–5(f).** We added Qwen3-VL-30B-A3B as a representative large-scale SMoE reference model in the routing-behavior analyses. * **Revised Summary of Key Results — Section 6.** We rewrote the summary section to organize the main findings into clearer design-oriented principles. * **Training time and resource usage — Appendix J.1, Table 14.** We added training time and GPU resource allocation across all experimental settings. * **Peak GPU memory and inference latency — Appendix J.2, Table 15.** We added peak GPU memory usage and per-sample inference latency for all seven SMoE methods.

Assigned Action Editor: ~Yen-Chang_Hsu1

Submission Number: 7282

Loading