Why Routers Freeze: Infinite Width Learning Dynamics for Mixture of Experts

Published: 29 May 2026, Last Modified: 29 May 2026HiLD at ICML 2026 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: moe, scaling, tensor programs
TL;DR: We derive infinite width limit for fixed expert MoEs and provide insight into router behaviour
Abstract: Mixture-of-Experts (MoE) models scale efficiently through sparse expert activation, but their training dynamics remain poorly understood. We study MoEs in the infinite-width limit with a fixed number of experts, a regime relevant to width-based scaling for hyperparameter transfer. Using Tensor Programs, we derive the training dynamics of soft and Top-$K$ MoEs under SGD and Adam. We show that under the Standard Parameterisation, router logits diverge after one step of feature learning, causing softmax or sigmoid gates to saturate and router gradients to vanish. In contrast, we derive $\mu$P-MoE scaling which restores stability, but soft routing produces symmetric router dynamics: experts remain identically distributed and fail to specialise. For softmax routers, this symmetry also nullifies router gradients. We then show that Top-$K$ routing has a qualitatively different effect: even when logits converge to a symmetric limit, finite-width fluctuations determine the selected experts, making Top-$K$ an implicit symmetry-breaking mechanism. Experiments validate the predicted scaling laws and demonstrate hyperparameter transfer under $\mu$P-MoE.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 174
Loading