SNECV-Muon Energy-Aware Adaptive Orthogonalization for Blockwise Muon in Shard MoE Pretraining

Shen Jiarun, Li Xiao

Published: 28 Mar 2026, Last Modified: 28 Mar 2026OpenReview Archive Direct UploadEveryoneRevisionsCC BY 4.0

Abstract: Muon-style optimizers improve optimization by orthogonalizing matrix-valued updates, but their distributed cost becomes significant under tensor parallelism. Existing blockwise variants reduce communication by orthogonalizing local shards independently, and periodic correction methods recover part of the lost global geometry. In sparse Mixture-of-Experts (MoE) training, however, routing creates strong cross-shard heterogeneity, making a fixed periodic schedule suboptimal. We propose SNECV-Muon, a simple adaptive controller for blockwise Muon. The method monitors the coefficient of variation of shard energies and standardizes this signal with an exponential moving average per matrix. The resulting score acts as an online estimate of when the block-diagonal approximation is reliable and when a global full Muon update is needed. SNECV-Muon performs local orthogonalization by default, damps updates in mildly abnormal regimes, and triggers global full orthogonalization only when shard imbalance becomes statistically significant. We evaluate Adam, Dion, Muon, MuonBP, and SNECV-Muon on several variants of Dense and MoE language models in Megatron-LM with tensor parallelism and expert parallelism. Under matched communication budgets, SNECV-Muon consistently improves the throughput--quality tradeoff over MuonBP, either reaching lower loss or matching loss with higher throughput. Additional analysis shows that the trigger tracks cross-shard geometric imbalance and correlates with the failure of purely local orthogonalization.