A unified theory of adaptive stochastic gradient descent as Bayesian filtering

Laurence Aitchison

A unified theory of adaptive stochastic gradient descent as Bayesian filtering

Laurence Aitchison

27 Sept 2018 (modified: 22 Jun 2025)ICLR 2019 Conference Blind SubmissionReaders: Everyone

Abstract: We formulate stochastic gradient descent (SGD) as a novel factorised Bayesian filtering problem, in which each parameter is inferred separately, conditioned on the corresopnding backpropagated gradient. Inference in this setting naturally gives rise to BRMSprop and BAdam: Bayesian variants of RMSprop and Adam. Remarkably, the Bayesian approach recovers many features of state-of-the-art adaptive SGD methods, including amongst others root-mean-square normalization, Nesterov acceleration and AdamW. As such, the Bayesian approach provides one explanation for the empirical effectiveness of state-of-the-art adaptive SGD algorithms. Empirically comparing BRMSprop and BAdam with naive RMSprop and Adam on MNIST, we find that Bayesian methods have the potential to considerably reduce test loss and classification error.

Keywords: SGD, Bayesian, RMSprop, Adam

TL;DR: We formulated SGD as a Bayesian filtering problem, and show that this gives rise to RMSprop, Adam, AdamW, NAG and other features of state-of-the-art adaptive methods

Community Implementations: [![CatalyzeX](/images/catalyzex_icon.svg) 1 code implementation](https://www.catalyzex.com/paper/a-unified-theory-of-adaptive-stochastic/code)

33 Replies

Loading