Late-Phase Second-Order Training

Lukas Tatzel; Philipp Hennig; Frank Schneider

Late-Phase Second-Order Training

Lukas Tatzel, Philipp Hennig, Frank Schneider

Published: 20 Oct 2022, Last Modified: 05 May 2023HITY Workshop NeurIPS 2022Readers: Everyone

Keywords: deep learning, stochastic optimization, Hessian-free optimizer, second-order method, late-phase training, fine-convergence, learning rate decay, PyTorch implementation

TL;DR: We study the Hessian-free optimizer as an alternative to learning rate decays for late-phase training.

Abstract: Towards the end of training, stochastic first-order methods such as SGD and ADAM go into diffusion and no longer make significant progress. In contrast, Newton-type methods are highly efficient "close" to the optimum, in the deterministic case. Therefore, these methods might turn out to be a particularly efficient tool for the final phase of training in the stochastic deep learning context as well. In our work, we study this idea by conducting an empirical comparison of a second-order Hessian-free optimizer and different first-order strategies with learning rate decays for late-phase training. We show that performing a few costly but precise second-order steps can outperform first-order alternatives in wall-clock runtime.

3 Replies

Loading