Principal Spectral Regularization Makes Momentum Surpass Adam for LLM Training

19 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Spectral regularization, Optimization, Large Language Model (LLM)
Abstract: Adam has been the most popular optimizer for training deep neural networks for nearly a decade. Recently, Muon, known for its momentum orthogonalization property, has emerged as a strong alternative to training large language models (LLMs). However, is orthogonalization over the whole learning space really necessary, especially given the high computational complexity of Newton-Schulz iteration in Muon? To the best of our knowledge, we are the first to report that Momentum with marginal spectral regularization on very few dimensions can surprisingly surpass Adam. In this work, we mainly made three contributions. First, from spectral visualizations of the LLM training dynamics and the optimization of the Styblinski-Tang function, we observe that the full orthogonalization of the matrix can be suboptimal in some cases. Second, we propose a novel principal spectral regularization (PSR) method that selectively penalizes only the dominant components with computational efficiency. Third, we show that the PSR approach enables SGD with momentum to surpass Adam in pretraining LLMs.
Primary Area: optimization
Submission Number: 18613
Loading