Keywords: non-convex optimization, communication compression, error feedback
Abstract: Recent optimizers like Muon, Scion, and Gluon have pushed the frontier of large-scale deep learning by exploiting layer-wise linear minimization oracles (LMOs) over non-Euclidean norm balls, capturing neural network structure in ways traditional algorithms cannot. Yet, no principled distributed framework exists, and communication bottlenecks remain unaddressed. Existing solutions are largely heuristic and lack any theoretical support.
We introduce EF21-Muon, the first communication-efficient, non-Euclidean LMO-based optimizer with rigorous convergence guarantees. EF21-Muon supports stochastic gradients, momentum, and bidirectional compression with error feedback, recovering Muon/Scion when compression is off and specific norms are chosen-providing the first efficient distributed implementation of this powerful family. Our theory covers non-Euclidean layer-wise smooth and the sharper layer-wise $(L^0,L^1)$-smooth setting, matching best-known Euclidean rates while enabling faster convergence under suitable norms. Experiments on language modeling tasks confirm that EF21-Muon delivers significant communication savings without accuracy loss.
Submission Number: 103
Loading