The results in [17] are based on the following assumptions.
Assumption 1. Assume $f_i$ are gradient Lipschitz continuous with constant $L$, and $f$ is lower bounded by a finite constant $f^*$.
Assumption 2. Assume \(\begin{equation} \sum_{i=0}^{n-1}\left\|\nabla f_{i}(x)\right\|_{2}^{2} \leq D_{1}\|\nabla f(x)\|_{2}^{2}+D_{0}. \end{equation}\)
When $D_1=n$, Assumption 2 becomes the “bounded variance’’ assumption with constant $D_0/n$. When $D_0=0$, Assumption 2 is often called “strong growth condition” (SGC) [20].
One sparkling contribution of [17] is that their analysis does not require the bounded gradient assumption, i.e., \(\begin{equation}\left\| \nabla f(x) \right\|< C, \forall x \end{equation}\). Removing this assumption is important for two reasons. First, the bounded gradient condition does not hold in the practical applications of Adam (deep neural nets training), not even for the simplest quadratic loss function $f(x)=x^2$. Second (and perhaps more importantly in this context), with bounded gradients assumptions, the gradients cannot diverge, while there is a counter-example showing that the gradient can diverge for certain problems [17].
We restate their convergence result as follows.
\[\begin{equation} \min _{k \in(1, T]} \min \left\{\left\|\nabla f (x_{k,0})\right\|_{1},\left\|\nabla f (x_{k,0})\right\|_{2}^{2} \sqrt{\frac{D_{1} d}{D_{0}}}\right\} \leq \mathcal{O}\left(\frac{\log T}{\sqrt{T}}\right)+\mathcal{O}\left(\sqrt{D_{0}}\right). \end{equation}\]Theorem 1 (Theorem 4.4 in [17]) Consider the finite-sum problem (1) under Assumption 1 and 2. When $\beta_2 \geq 1- \mathcal{O}(n^{-3.5})$ and $\beta_1 \leq \mathcal{O}(n^{-2.5}) $, Adam with stepsize $\eta_k = \eta_1/\sqrt{k}$ converges to a bounded region.
It is expected that a stochastic algorithm only converges to a bounded region instead of a critical point, both in theory and practice. Indeed, the “convergence” of constant-stepsize SGD is in the sense of “converging to a region with size proportional to the noise variance”. Similar to SGD, in Theorem 1, the size of the region goes to zero as the noise variance goes to 0 or $D_0$ goes to 0.
Theorem 1 focuses on RMSProp, which is Adam with $\beta_1=0$. Motivated from Theorem 1, [17] also provides a convergence result for Adam with small enough $\beta_1$.
\[\begin{equation}\min _{t \in(1, T]} \min \left\{\left\|\nabla f (x_{k,0})\right\|_{1},\left\|\nabla f (x_{k,0})\right\|_{2}^{2} \sqrt{\frac{D_{1} d}{D_{0}}}\right\} \leq \mathcal{O}\left(\frac{\log T}{\sqrt{T}}\right)+\mathcal{O}\left(\sqrt{D_{0}}\right). \end{equation}\]Theorem 2 (Theorem 4.4 in [17]) Consider the finite-sum problem (1) under Assumption 1 and 2. When $\beta_2 \geq 1- \mathcal{O}(n^{-3.5})$ and $\beta_1 \leq \mathcal{O}(n^{-2.5}) $, Adam with stepsize $\eta_k = \eta_1/\sqrt{k}$ converges to a bounded region.
Reconcile the two papers. Reddi et al. [14] proved that Adam (including RMSProp) does not converge for a large set of hyperparameters, and [17] proved that RMSProp converges for large enough $\beta_2$. One might think that they are not contradictory because they cover different hyper-parameter settings; but this understanding is not correct and the actual relation is more subtle. Reddi et al. [14] showed that for any $\beta_2 \in [0 , 1) $ and $\beta_1 = 0 $, there exists a convex problem that RMSProp does not converge to optima. As a result, $ (\beta_1, \beta_2 ) = (0, 0.99) $ is a hyperparameter combination that can cause divergence; $ (\beta_1, \beta_2 ) = (0, 0.99999) $ can also cause divergence. In fact, no matter how much $\beta_2 $ is close to 1, $ (\beta_1, \beta_2 ) = (0, \beta_2 )$ can still cause divergence. So why does Shi et al. [17] claim “large enough $\beta_2$ makes RMSProp converge’’? The key lies in whether $ \beta_2$ is picked before or after picking the problem instance. What Shi et al. [17] proves is that: if $\beta_2$ is picked after the problem is given (thus $\beta_2$ can be problem-dependent), then RMSProp converges. This does not contradict the counter-example of [14] which picks $\beta_2$ before seeing the problem.
With the above discussion, we highlight two messages on the choice of $\beta_2$:
$\beta_2$ is definitely not the first problem-dependent hyperparameter that we know. A much more well-known example is the stepsize: when the objective function is $L$-smooth, the stepsize of GD is a problem-dependent hyperparameter since it shall be less than $2/L$. For a given stepsize $\alpha $, one can always find a problem that GD with this stepsize diverges, but this does not mean “GD is non-convergent’’. The message above is: if we view $\beta_2 $ as a problem-dependent hyperparameter, then one can even say “RMSProp is convergent’’ in the sense that “RMSProp is convergent under proper choice of a problem-dependent hyperparameter’’.
[click here to go to the top][click here to go to the reference]
The above results by [17] take one remarkable step towards understanding Adam. Combining with the counter-example by [14], they show a phase transition from divergence to convergence when increasing $\beta_2$ from 0 to 1. However, Shi et al. [17] do not conclude the convergence discussion for Adam. In Theorem 1 and Theorem 2, they require $\beta_1$ to be either 0 or small enough. Is this a reasonable requirement? Does the requirement of $\beta_1$ match the practical use of Adam? If not, how large is the gap? As a result, this gap is actually non-negligible from multiple different perspectives. We elaborate as follows.
Gap with practice. We did some simple calculation regarding Theorem 2. To ensure convergence,
Theorem 2 requires $\beta_1 \leq \mathcal{O}(n^{-2.5})$. On CIFAR-10 with sample size 50,000 and batchsize 128, they need $\beta_1 < \mathcal{O}((50000/128)^{-2.5} ) \approx 10^{-7} $. This tiny value of $\beta_1$ is rarely used in practice. In fact, although there are certain scenarios where small $\beta_1$ is used (e.g. some methods for GAN and RL such as [16] and [12]), but in these cases, often 0 or 0.1 are used, rather than a tiny non-zero value $10^{-7}$. For most applications of Adam, larger $\beta_1$ is used. Kingma and Ba [8] claimed that $\beta_1=0.9$ is a “good default setting for the tested machine learning problems.” Later on, $\beta_1=0.9$ is also adopted in PyTorch default setting.
Lack in providing useful message on $\beta_1$. One might argue that Theorem 2 just provided a theoretical bound on $\beta_1$ and does not have to match practical values. In fact, the required bound on $\beta_2$ of Theorem 1 is $1 - \mathcal{O}(n^{-3.5})$. This value is also larger than the practical value of $\beta_2 $ such as $0.999 = 1-0.001$. However, there is a major difference between the lower bound of $\beta_2$ and the upper bound of $\beta_1$: the former provides a conceptual message that $\beta_2$ shall be large enough to ensure good performance which matches experiments, while the latter does not seem to provide any useful message.
Theoretical gap in the context of [14]. The counter-example of [14] applies to any $(\beta_1, \beta_2)$ such that $ \beta_1 < \sqrt{\beta_2 }$. The counter-example is valid and there is no way to prove Adam converges for general problems for this hyperparameter combination. Nevertheless, as argued earlier, the caveat is on problem independent hyperparameters. As argued earlier, Shi et al. [17] noticed the counter-example of [14] applies to problem-independent hyperparameters, and switching the order of problem-picking and hyper-parameter-picking can lead to convergence. But this switching-order-argument only applies to $\beta_2$ and not necessarily applies to $\beta_1$. When $\beta_2$ is problem-dependent, it is not clear whether Adam converges for larger $\beta_1$.
Next, we discuss the importance of understanding Adam with large $\beta_1$.
Possible empirical benefit for understanding Adam with large $\beta_1$. (We call $\beta_1$ “large” when it is, at least, larger than 0.1.) In the above, we have discussed the theoretical, empirical , and conceptual gap on the understanding of Adam’s convergence. We next discuss one possible empirical benefit for filling in this gap: guiding practitioners to better tune hyperparameters. At the current stage, many practitioners are trapped in the default setting with no idea how to better tune $\beta_1$ and $\beta_2$. Shi et al. [17] provide a simple guidance on tuning $\beta_2$ when $\beta_1 = 0$: start from $\beta_2 = 0.8 $ and tune $\beta_2$ up until reaching the best performance. Nevertheless, there is not much guidance on tuning $\beta_1$. If Adam does not solve your tasks well in the default setting $\beta_1=0.9$, how should you tune the hyperparameter to make it work? Shall you tune it up or down or both? When you tune $\beta_1$, shall you tune $\beta_2$ as well? A more confusing phenomenon is that both large $\beta_1$ like $\beta_1 = 0.9$ and small $\beta_1 $ like $\beta_1 = 0$ are used in different papers. This makes it hard to guess what is a proper way of tuning $\beta_1$ (together with tuning $\beta_2$).
How difficult could it be to adopt large $\beta_1$ into convergence analysis? We are inclined to believe it is not easy. Momentum contains a heavy amount of history information which dramatically distorts the trajectory of the iterates. Technically speaking, Shi et al. [17] treat momentum as a pure error deviating from the gradient direction. Following this proof idea, this error can only be controlled when $\beta_1$ is closed enough to 0. To cover large $\beta_1$ into convergence analysis, one needs to handle momentum from a different perspective.
[click here to go to the top][click here to go to the reference]
In this blog post, we briefly review the non-convergence results by [14] and the convergence results by [17]. Their results take remarkable steps forward to understand Adam better. Meanwhile, they also expose many new questions that are not yet discussed. Compared with its practical success, the current theoretical understanding for Adam is still left behind.
This post outlines a few more things you may need to know for creating and configuring your blog posts.