Keywords: Mixture of Experts, Large Language Models, Load balancing, Expert Specialization
Abstract: Sparse Mixture-of-Experts (MoE) models scale capacity by routing tokens to a small subset of experts.
Expert grouping improves hardware utilization but introduces a subtle weight-space symmetry: experts within a group are exposed to similar token distributions, driving them toward permutation-equivalent (redundant) solutions.
We propose Hi-MoE, a hierarchical framework that explicitly breaks this symmetry through two complementary objectives: (i) an inter-group balancing term that enforces fair traffic across device-aligned expert groups, and (ii) an intra-group diversity term that promotes complementary expert behaviors and prevents within-group collapse.
We show that these objectives are interpretable as Lagrange multipliers of a principled balance--specialization constrained optimization problem.
Experiments on Swin-MoE (Tiny ImageNet) and OLMoE-7B (58B tokens) show consistent improvements: Hi-MoE-7B achieves a 5.6% perplexity reduction and 40% better expert balance over OLMoE-7B. Our code is available at: https://github.com/brain-lab-research/Hi-MoE.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 47
Loading