On the Width Scaling of Neural Optimizers Under Matrix Operator Norms I: Row/Column Normalization and Hyperparameter Transfer
Abstract: A central question in modern deep learning is how to design optimizers whose behavior remains
stable as the network width 𝑤 increases. We address this question by interpreting several widely
used neural-network optimizers, including AdamW and Muon, as instances of steepest descent under
matrix operator norms. This perspective links optimizer geometry with the Lipschitz structure of
the network forward map, and enables width-independent control of both Lipschitz and smoothness
constants. However, steepest-descent rules induced by standard 𝑝 → 𝑞 operator norms lack layerwise
composability and therefore cannot provide width-independent bounds in deep architectures. We
overcome this limitation by introducing a family of mean-normalized operator norms, denoted
(𝑝, mean) → (𝑞, mean), that admit layerwise composability, yield width-independent smoothness
bounds, and give rise to practical optimizers such as rescaled AdamW, row normalization, and
column normalization. The resulting learning rate width-aware scaling rules recover 𝜇P scaling [59]
as a special case and provide a principled mechanism for cross-width learning-rate transfer across a
broad class of optimizers. We further show that Muon can suffer an O (√𝑤) worst-case growth in
the smoothness constant, whereas a new family of row-normalized optimizers we propose achieves
width-independent smoothness guarantees. Based on the observations, we propose MOGA (Matrix
Operator Geometry Aware), a width-aware optimizer based only on row/column-wise normalization
that enables stable learning-rate transfer across model widths. Large-scale pre-training on GPT-2
and LLaMA shows that MOGA, especially with row normalization, is competitive with Muon while
being notably faster in large-token and low-loss regimes.
Loading