Generalization Lower Bounds for GD and SGD in Smooth Stochastic Convex Optimization
Abstract: This work studies the generalization error of gradient methods. More specifically, we focus on how training steps $T$ and step-size $\eta$ might affect generalization in smooth stochastic convex optimization (SCO) problems. Recent works show that in some cases longer training can hurt generalization. Our work reexamines this for smooth SCO and find that the conclusion can be case-dependent. In particular, we first study SCO problems when the loss is \emph{realizable}, i.e. an optimal solution minimizes all the data points. Our work provides excess risk lower bounds for Gradient Descent (GD) and Stochastic Gradient Descent (SGD) and finds that longer training may not hurt generalization. In the short training scenario $\eta T = O(n)$ ($n$ is sample size), our lower bounds tightly match and certify the respective upper bounds. However, for the long training scenario where $\eta T =O(n)$, our analysis reveals a gap between the lower and upper bounds, indicating that longer training does hurt generalization for realizable objectives. A conjecture is proposed that the gap can be closed by improving upper bounds, supported by analyses in two special instances. Moreover, besides the realizable setup, we also provide first tight excess risk lower bounds for GD and SGD under the general non-realizable smooth SCO setting, suggesting that existing stability analyses are tight in step-size and iteration dependence, and that overfitting provably happens when there is no interpolating minimum.
Submission Number: 125
Loading