Abstract: Gradient signals in LLM training are highly anisotropic: recurrent linguistic structure concentrates energy into a small set of dominant spectral directions, while context-specific information resides in a long tail. We show that this spike–tail separation persists throughout training, with the spike occupying only about 1.5% of directions yet dominating optimizer statistics. This dominance suppresses tail learning by contracting tail updates through second-moment normalization and tightening the globally stable learning-rate bound.
Motivated by this analysis, we propose Spectra, a spike-aware optimizer that suppresses the dominant low-rank spike subspace without amplifying the noise-sensitive spectral tail. Spectra tracks the spike subspace via cached, warm-started power iteration and applies low-rank spectral shaping with negligible overhead and substantially reduced optimizer-state memory.
Across Qwen3-0.6B trained on 100B tokens and LLaMA3-8B trained on 50B tokens, Spectra achieves the lowest final validation loss, improving average downstream accuracy by 1.41/0.89 and 1.62/0.66 points over AdamW/Muon, respectively. For wall-clock convergence, Spectra reaches matched loss targets up to 1.31×, 1.34×, and 1.24× faster than AdamW on Qwen3-0.6B, Qwen3-2B-A0.8B, and Qwen3-8B; its speedup over Muon grows as model scale increases from 0.6B to 8B. For computational efficiency, Spectra is 5.1× faster than Muon in optimizer processing time, cuts optimizer-state memory by 49.25%, and achieves the lowest measured end-to-end per-step runtime. Spectra's Megatron integration is available at https://github.com/kimmichtank/spectra.
Lay Summary: Training large language models is expensive, and the way a model updates itself during training strongly affects both cost and final quality. We find that these updates are often highly unbalanced: a small number of dominant directions capture repeated common language patterns, while many weaker directions carry more specific, long-tail information. Existing optimizers can let the dominant directions control the training process, which may slow down learning in the weaker but important directions.
We propose Spectra, an optimizer that identifies these dominant directions and gently reduces their influence, without over-amplifying the noisier long-tail directions. Instead of processing the full update structure, Spectra tracks only a small low-dimensional part of it, making the method efficient enough for large-scale language model training.
Across several language model settings, Spectra reaches lower training loss, improves downstream accuracy, and converges faster than widely used optimizers such as AdamW and Muon. This suggests that looking at training updates as structured signals, rather than independent numbers, can make large language models cheaper and more effective to train.
Primary Area: Deep Learning->Large Language Models
Keywords: pertaining, llm, anisotropy
Originally Submitted PDF: pdf
Submission Number: 21047
Loading