Hierarchical Mixture-of-Experts with Two-Stage Optimization

Published: 24 May 2026, Last Modified: 02 Jun 2026ICML 2026 Workshop WSS PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Mixture of Experts, Large Language Models, Load balancing, Expert Specialization
Abstract: Sparse Mixture-of-Experts (MoE) models scale capacity by routing tokens to a small subset of experts. Expert grouping improves hardware utilization but introduces a subtle weight-space symmetry: experts within a group are exposed to similar token distributions, driving them toward permutation-equivalent (redundant) solutions. We propose Hi-MoE, a hierarchical framework that explicitly breaks this symmetry through two complementary objectives: (i) an inter-group balancing term that enforces fair traffic across device-aligned expert groups, and (ii) an intra-group diversity term that promotes complementary expert behaviors and prevents within-group collapse. We show that these objectives are interpretable as Lagrange multipliers of a principled balance--specialization constrained optimization problem. Experiments on Swin-MoE (Tiny ImageNet) and OLMoE-7B (58B tokens) show consistent improvements: Hi-MoE-7B achieves a 5.6% perplexity reduction and 40% better expert balance over OLMoE-7B. Our code is available at: https://github.com/brain-lab-research/Hi-MoE.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 47
Loading