Track: tiny paper (up to 4 pages)
Keywords: Spectral Conditioning, Optimization Geometry, Transformer Learning
Abstract: Training large Transformers is often bottlenecked by early optimization dynamics rather than model capacity. In this work, we identify a concrete spectral pathology that emerges early in training and show that it can be mitigated with minimal, targeted intervention. Specifically, we show that a small number of weight matrices—namely the attention output projection and the MLP down-projection—develop severe spectral spikiness, characterized by rapid growth of the top singular value relative to the bulk spectrum. This induces ill-conditioning, distorts gradient flow, and slows convergence in language model pretraining. We demonstrate that geometry-aware optimization (Muon) suppresses this pathology by implicitly controlling matrix spectra, while standard AdamW lacks such regulation. Crucially, we show that targeted, early-phase spectral stabilization applied only to these matrices further improves Muon’s performance as well as benefit AdamW. Our results identify spectral conditioning on certain layers as a central optimization bottleneck in Transformers and show that minimal, localized geometric control is sufficient to substantially accelerate learning.
Anonymization: This submission has been anonymized for double-blind review via the removal of identifying information such as names, affiliations, and identifying URLs.
Submission Number: 73
Loading