Keywords: Optimization, Deep Learning, Muon, Generalized Smoothness
TL;DR: Novel layer-wise smoothness assumption provides a more realistic theoretical basis for LMO-based optimizers, leading to adaptive stepsizes.
Abstract: Recent developments in deep learning optimization have brought about radically new algorithms based on the Linear Minimization Oracle (LMO) framework, such as π¬πππ and π²πΌπππ. After over a decade of π π½πΊπ's dominance, these LMO-based methods are emerging as viable replacements, offering several practical advantages such as improved memory efficiency, better hyperparameter transferability, and most importantly, superior empirical performance on large-scale tasks, including LLM training. However, a significant gap remains between their practical use and our current theoretical understanding: prior analyses (1) overlook the layer-wise LMO application of these optimizers in practice, and (2) rely on an unrealistic smoothness assumption, leading to impractically small stepsizes. To address both, we propose a new LMO-based framework called π¦π
πππ, capturing prior theoretically analyzed methods as special cases, and introduce a new refined generalized smoothness model that captures the layer-wise geometry of neural networks, matches the layer-wise practical implementation of π¬πππ and π²πΌπππ, and leads to state-of-the-art convergence guarantees. Our experiments with NanoGPT and CNN confirm that our assumption holds along the optimization trajectory, ultimately closing the gap between theory and practice.
Primary Area: optimization
Submission Number: 16183
Loading