TL;DR: Modular dualization maps gradients to weight space in neural nets, enabling fast and scalable training algorithms automatically optimized for different architectures. Successfully used by the community to speed up NanoGPT training.
Abstract: An old idea in optimization theory says that since the gradient is a dual vector it may not be subtracted from the weights without first being mapped to the primal space where the weights reside. We take this idea seriously in this paper and construct such a duality map for general neural networks. Our map, which we call modular dualization, forms a unifying theoretical basis for training algorithms that are a) fast and b) scalable. Modular dualization involves first assigning operator norms to layers based on the semantics of each layer, and then using these layerwise norms to recursively induce a duality map on the weight space of the full neural architecture. We derive GPU-friendly algorithms for dualizing Embed, Linear and Conv2D layers—the latter two methods are based on a Newton-Schulz iteration. We conclude with small experiments demonstrating the speed, scalability and novel numerical properties of duality-based optimizers. Our methods were used in the Muon optimizer, which recently set speed records for training NanoGPT and was scaled up to a 1.5 billion parameter transformer.
Lay Summary: **Problem:** Vanilla gradient descent has a fundamental issue---it directly subtract gradients from weights without considering that different parts of the network may have very different geometric properties. This is like trying to subtract apples from oranges, and the result is that gradient descent can be slow and scale poorly across different neural network sizes.
**Solution:** We developed "modular dualization", a systematic way to create proper conversion maps (called duality maps) to apply to the gradients of any neural network architecture, allowing the converted gradients to be subtracted from the weights in a sensible way. Our method works in three steps: first, we assign appropriate geometric measures to individual layers based on what each layer actually does; second, we create conversion rules for each layer type; third, we combine these layer-wise rules to build a single conversion map for the entire network. We also created efficient GPU algorithms to compute these conversions quickly for the most common layer types like linear and convolutional layers.
**Impact:** This approach unifies two important but seemingly different optimization methods---maximal update parameterization (μP) for scalable training and Shampoo for fast training---showing they're both approximations of our single theoretical idea. Our methods have already led to significant speedups in practice: the Muon optimizer based on our theory recently set new speed records for training language models, scaling from small networks to 1.5 billion parameter transformers. Beyond speed, our approach reveals novel properties of neural network training, such as allowing weights to move much further from their starting values than traditional methods, challenging conventional wisdom about how neural networks learn.
Link To Code: https://modula.systems
Primary Area: Deep Learning->Theory
Keywords: modular, duality, Newton-Schulz
Submission Number: 11992
Loading