Catastrophic Fisher Explosion: Early Phase Fisher Matrix Impacts Generalization

Stanislaw Kamil Jastrzebski; Devansh Arpit; Oliver Åstrand; Giancarlo Kerg; Huan Wang; Caiming Xiong; richard socher; Kyunghyun Cho; Krzysztof J. Geras

Catastrophic Fisher Explosion: Early Phase Fisher Matrix Impacts Generalization

Stanislaw Kamil Jastrzebski, Devansh Arpit, Oliver Åstrand, Giancarlo Kerg, Huan Wang, Caiming Xiong, richard socher, Kyunghyun Cho, Krzysztof J. Geras

28 Sept 2020 (modified: 05 May 2023)ICLR 2021 Conference Blind SubmissionReaders: Everyone

Keywords: early phase of training, implicit regularization, SGD, learning rate, batch size, Hessian, Fisher Information Matrix, curvature, gradient norm

Abstract: The early phase of training has been shown to be important in two ways for deep neural networks. First, the degree of regularization in this phase significantly impacts the final generalization. Second, it is accompanied by a rapid change in the local loss curvature influenced by regularization choices. Connecting these two findings, we show that stochastic gradient descent (SGD) implicitly penalizes the trace of the Fisher Information Matrix (FIM) from the beginning of training. We argue it is an implicit regularizer in SGD by showing that explicitly penalizing the trace of the FIM can significantly improve generalization. We further show that the early value of the trace of the FIM correlates strongly with the final generalization. We highlight that in the absence of implicit or explicit regularization, the trace of the FIM can increase to a large value early in training, to which we refer as catastrophic Fisher explosion. Finally, to gain insight into the regularization effect of penalizing the trace of the FIM, we show that it limits memorization by reducing the learning speed of examples with noisy labels more than that of the clean examples, and 2) trajectories with a low initial trace of the FIM end in flat minima, which are commonly associated with good generalization.

One-sentence Summary: Explicit regularization of the trace of the Fisher Information Matrix models implicit regularization in stochastic gradient descent.

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics

Reviewed Version (pdf): https://openreview.net/references/pdf?id=7RoGvd1Lbn

20 Replies

Loading