Keywords: Depth scaling; MoE
Abstract: Mixture of Experts (MoE) Transformers rely on a router to distribute tokens across experts, and a severely unbalanced router wastes model capacity. We study load balance at \emph{random initialization}, before any auxiliary loss has had an effect, and show that it is fundamentally a question about the geometry of the hidden states entering the router. When hidden states across tokens are diverse, a random linear router produces diverse logits and top-$k$ selection naturally spreads tokens across experts; when hidden states collapse, the router collapses with them. We connect this observation to representation collapse in deep pre-norm Transformers and argue that $1/\sqrt{L}$ depth scaling, beyond its known benefits for training stability, also improves routing balance at initialization. We additionally observe Muon better preserves this balance during training by producing orthogonalized updates to the router and expert matrices, and verify our claims empirically.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 123
Loading