Signal Frequency Imbalance and Ill-Conditioning

Published: 29 May 2026, Last Modified: 29 May 2026HiLD at ICML 2026 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Neural Network Optimization, Batch Size Scaling
TL;DR: Frequency imbalance from gradient clusters creates ill-conditioning at large batch sizes, where minibatch averaging suppresses rare directions; Adam and Muon mitigate this effect better than SGD.
Abstract: The source of the ill-conditioning addressed by Adam- and Muon-like optimizers remains poorly understood, making it unclear when and why they outperform SGD. We introduce a generalization of signal frequency imbalance that captures effective low-rank structure arising from correlations between weight-space directions and data subsets. Empirically, we show that this structure appears in the inner layers of language models as semantically meaningful gradient clusters, helping explain why Muon can outperform Adam. On a simplified problem, we show that signal frequency imbalance induces ill-conditioning only at large batch sizes, where minibatch averaging suppresses progress along rare directions, explaining why Adam and Muon outperform SGD only in this regime.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 117
Loading