Entropic gradient descent algorithms and wide flat minimaDownload PDF

Published: 12 Jan 2021, Last Modified: 22 Oct 2023ICLR 2021 PosterReaders: Everyone
Keywords: flat minima, entropic algorithms, statistical physics, belief-propagation
Abstract: The properties of flat minima in the empirical risk landscape of neural networks have been debated for some time. Increasing evidence suggests they possess better generalization capabilities with respect to sharp ones. In this work we first discuss the relationship between alternative measures of flatness: The local entropy, which is useful for analysis and algorithm development, and the local energy, which is easier to compute and was shown empirically in extensive tests on state-of-the-art networks to be the best predictor of generalization capabilities. We show semi-analytically in simple controlled scenarios that these two measures correlate strongly with each other and with generalization. Then, we extend the analysis to the deep learning scenario by extensive numerical validations. We study two algorithms, Entropy-SGD and Replicated-SGD, that explicitly include the local entropy in the optimization objective. We devise a training schedule by which we consistently find flatter minima (using both flatness measures), and improve the generalization error for common architectures (e.g. ResNet, EfficientNet).
One-sentence Summary: Relation between local entropy, flat minima, entropic algorithms and good generalization.
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics
Supplementary Material: zip
Code: [![gitlab](/images/gitlab_icon.svg) bocconi-artlab/sacreddnn](https://gitlab.com/bocconi-artlab/sacreddnn)
Data: [MNIST](https://paperswithcode.com/dataset/mnist), [Tiny ImageNet](https://paperswithcode.com/dataset/tiny-imagenet)
Community Implementations: [![CatalyzeX](/images/catalyzex_icon.svg) 1 code implementation](https://www.catalyzex.com/paper/arxiv:2006.07897/code)
10 Replies