From Muon to Gluon: Bridging Theory and Practice of LMO-based Optimizers for LLMs

ICLR 2026 Conference Submission16183 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Optimization, Deep Learning, Muon, Generalized Smoothness
TL;DR: Novel layer-wise smoothness assumption provides a more realistic theoretical basis for LMO-based optimizers, leading to adaptive stepsizes.
Abstract: Recent developments in deep learning optimization have brought about radically new algorithms based on the Linear Minimization Oracle (LMO) framework, such as π–¬π—Žπ—ˆπ—‡ and π–²π–Όπ—‚π—ˆπ—‡. After over a decade of 𝖠𝖽𝖺𝗆's dominance, these LMO-based methods are emerging as viable replacements, offering several practical advantages such as improved memory efficiency, better hyperparameter transferability, and most importantly, superior empirical performance on large-scale tasks, including LLM training. However, a significant gap remains between their practical use and our current theoretical understanding: prior analyses (1) overlook the layer-wise LMO application of these optimizers in practice, and (2) rely on an unrealistic smoothness assumption, leading to impractically small stepsizes. To address both, we propose a new LMO-based framework called π–¦π—…π—Žπ—ˆπ—‡, capturing prior theoretically analyzed methods as special cases, and introduce a new refined generalized smoothness model that captures the layer-wise geometry of neural networks, matches the layer-wise practical implementation of π–¬π—Žπ—ˆπ—‡ and π–²π–Όπ—‚π—ˆπ—‡, and leads to state-of-the-art convergence guarantees. Our experiments with NanoGPT and CNN confirm that our assumption holds along the optimization trajectory, ultimately closing the gap between theory and practice.
Primary Area: optimization
Submission Number: 16183
Loading