Pre-Normalization Momentum Governs Optimizer-Induced Rank Bias
Keywords: Optimization Dynamics, Implicit Bias, Adaptive Optimizers, Low-Rank Structure, Spectral Bias, Deep Matrix Factorization
Abstract: Adaptive optimizers induce strikingly different implicit rank biases: Adam reliably recovers low-rank solutions in deep matrix factorization, whereas RMSProp catastrophically overshoots despite using the same adaptive normalization mechanism $g/\sqrt{v}$. We identify the responsible component as Adam's \emph{pre-normalization momentum filter}, the $\beta_1$ exponential moving average applied to raw gradients before adaptive normalization. We show that adaptive normalization removes the depth-dependent suppression underlying incremental rank learning, exposing adaptive methods to trailing-spectrum inflation. Under stationary trailing-direction noise, Adam's pre-normalization filter reduces update variance by $(1-\beta_1)/(1+\beta_1)$, corresponding to a standard-deviation reduction of $\approx 0.23$ at $\beta_1=0.9$. Controlled falsification experiments isolate this mechanism directly: removing $\beta_1$ breaks Adam's low-rank recovery, while post-normalization momentum fails to reproduce it. Sweeping $\beta_1$ reveals a sharp threshold, with stable low-rank recovery emerging only for $\beta_1 \gtrsim 0.7$. Finally, the mechanism transfers quantitatively to TinyLlama fine-tuning, where the observed RMSProp-to-AdamW update-magnitude ratio closely matches the theoretical prediction. Our results identify pre-normalization temporal filtering as a previously uncharacterized source of optimizer-induced spectral bias.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 73
Loading