DeltaMomentum: A Key-Value based Anisotropic Momentum Update via Delta Rule

Euijin Hong; Guannan Qu

DeltaMomentum: A Key-Value based Anisotropic Momentum Update via Delta Rule

Euijin Hong, Guannan Qu

Published: 29 May 2026, Last Modified: 29 May 2026HiLD at ICML 2026 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: high-dimensional learning dynamics, stochastic optimization, optimizer dynamics, momentum methods, anisotropic momentum, delta rule, implicit natural gradient, language model pretraining, scaling laws

Abstract: The first-moment buffer of most modern optimizers is an exponential moving average (EMA) of stochastic gradients with a single scalar decay, imposing the same forgetting horizon along every direction in parameter space. Yet the per-sample gradient of any linear layer factorizes as a rank-1 outer product $g_t = \delta_t x_t^\top$, exposing the input activation as a natural $\textit{key}$ and the output-side error as a natural $\textit{value}$---structure that EMA's flat matrix average discards. We propose \DeltaMomentum, which interprets the buffer as an online linear associative memory of these key--value pairs and updates it via the classical delta rule. The resulting dynamics implement $\textit{anisotropic memory transport}$: directions queried frequently are forgotten quickly, directions queried rarely are preserved on long horizons, yielding a direction-dependent memory horizon $\tau_i = 1/((1-\beta) + \eta\lambda_i)$ matched to the eigenstructure of the input-feature covariance. From this single dynamical identity we obtain: a Tikhonov-regularized Wiener fixed point equivalent to implicit input-side natural-gradient preconditioning; strictly faster tracking-error contraction along every positive-density direction under non-stationarity; and width-invariance of the delta coefficient under maximal-update parameterization. On Llama-2-style language-model pretraining at 67M/370M parameters, DeltaAdamW reaches AdamW's terminal validation loss in up to $31.35\%/19.28\%$ fewer steps; per-layer mechanistic diagnostics confirm the predicted dynamical signatures.

Email Sharing: We authorize the sharing of all author emails with Program Chairs.

Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.

Submission Number: 96

Loading