Abstract: A longstanding debate surrounds the related hypotheses that low-curvature minima generalize
better, and that stochastic gradient descent (SGD) discourages curvature. We offer a more complete
and nuanced view in support of both hypotheses. First, we show that curvature harms test
performance through two new mechanisms, the shift-curvature and bias-curvature, in addition to
a known parameter-covariance mechanism. The shift refers to the difference between train and test
local minima, and the bias and covariance are those of the parameter distribution. These three
curvature-mediated contributions to test performance are reparametrization-invariant even
though curvature itself is not. Although the shift is unknown at training time, the shift-curvature
as well as the other mechanisms can still be mitigated by minimizing overall curvature. Second, we
derive a new, explicit SGD steady-state distribution showing that SGD optimizes an effective
potential related to but different from train loss, and that SGD noise mediates a trade-off between
low-loss versus low-curvature regions of this effective potential. Third, combining our test
performance analysis with the SGD steady state shows that for small SGD noise, the shift-curvature
is the dominant of the three mechanisms. Our experiments demonstrate the significant impact of
shift-curvature on test loss, and further explore the relationship between SGD noise and curvature.
Loading