Keywords: Sharpness, Flatness, Generalization, Generalization Bound, SAM
TL;DR: For 2 layer ReLU networks, sharpness may not always imply generalization but sharpness minimization algorithms may still generalize even when non-generalizing flattest models exist.
Abstract: Despite extensive studies, the underlying reason as to why overparameterized
neural networks can generalize remains elusive. Existing theory shows that common stochastic optimizers prefer flatter minimizers of the training loss, and thus
a natural potential explanation is that flatness implies generalization. This work
critically examines this explanation. Through theoretical and empirical investigation, we identify the following three scenarios for two-layer ReLU networks: (1)
flatness provably implies generalization; (2) there exist non-generalizing flattest
models and sharpness minimization algorithms fail to generalize poorly, and (3)
perhaps most strikingly, there exist non-generalizing flattest models, but sharpness
minimization algorithms still generalize. Our results suggest that the relationship
between sharpness and generalization subtly depends on the data distributions
and the model architectures and sharpness minimization algorithms do not only
minimize sharpness to achieve better generalization. This calls for the search for
other explanations for the generalization of over-parameterized neural networks
Supplementary Material: zip
Submission Number: 8237
Loading