Adaptive Preconditioners Trigger Loss Spikes in Adam

Published: 30 Apr 2026, Last Modified: 24 Jun 2026ICML 2026 regularEveryoneRevisionsBibTeXCC BY 4.0
TL;DR: We provide a mechanistic analysis of loss spikes phenomenon in Adam training.
Abstract: Loss spikes commonly emerge during neural network training with the Adam optimizer across diverse architectures and scales, yet their underlying mechanism remains elusive. While previous explanations attribute these phenomena to sharper loss landscapes at lower loss, we show that landscape geometry alone is insufficient to explain the phenomenon. In this work, we pinpoint the root cause in the internal dynamics of Adam's second moment estimator. We identify a critical ``decoupling'' mechanism where the adaptive preconditioner $v_t$ fails to track the instantaneous squared gradients $g_t^2$, causing the adaptive mechanism to effectively fail. This decoupling allows the preconditioner to decay autonomously despite rising gradients, which pushes the maximum eigenvalue of the preconditioned Hessian beyond the stability threshold $2/\eta$ for sustained periods, manifesting as dramatic loss spikes. Through a quadratic approximation analysis, we theoretically and experimentally characterize five distinct stages of spike evolution and propose a predictor for anticipating spikes based on gradient-directional curvature. We empirically find that the proposed loss spike mechanism, although derived from simplified models, generalizes well to practical scenarios ranging from small neural networks to large-scale Transformers.
Lay Summary: When training deep neural networks, the goal is to smoothly reduce the model's "loss" or error. However, a widely used training tool called the Adam optimizer frequently suffers from "loss spikes"—sudden, violent increases in error that can severely disrupt the learning process. Previously, some scientists believed these spikes occurred simply because the network wandered into a sharply curved, unstable area of the loss landscape. Our research proves otherwise. We discovered a hidden mechanical flaw inside the Adam optimizer itself. Adam uses an internal memory mechanism to automatically adjust its learning pace. We found that under certain conditions, this memory stops paying attention to the actual steepness of the problem and begins to shrink on its own. Because of this "blind spot," Adam gets tricked into taking dangerously large steps. This built-up energy eventually launches the network out of its safe zone, resulting in a massive spike in loss. In this paper, we map out the exact five-stage life cycle of these spikes. We also introduce a new early-warning metric to predict them before they happen. Importantly, we demonstrate that this internal flaw isn't just a theoretical curiosity—it actively triggers instabilities from simple toy models to the massive neural networks powering today's advanced language models.
Originally Submitted Supplementary Material: zip
Primary Area: Theory->Learning Theory
Keywords: loss spike, Adam, training instability
Originally Submitted PDF: pdf
Submission Number: 19316
Loading