Fisher-Legendre (FishLeg) optimization of deep neural networks

Jezabel R Garcia; Federica Freddi; Stathi Fotiadis; Maolin Li; Sattar Vakili; Alberto Bernacchia; Guillaume Hennequin

Fisher-Legendre (FishLeg) optimization of deep neural networks

Jezabel R Garcia, Federica Freddi, Stathi Fotiadis, Maolin Li, Sattar Vakili, Alberto Bernacchia, Guillaume Hennequin

Published: 01 Feb 2023, Last Modified: 28 Feb 2023ICLR 2023 notable top 25%Readers: Everyone

Keywords: Second-order optimization, Natural Gradient, Deep Learning, Meta-learning, Fisher information, Legendre-Fenchel duality

TL;DR: We introduce a new approach to estimate the natural gradient via Legendre-Fenchel duality, provide a convergence proof, and show competitive performance on a number of benchmarks.

Abstract: Incorporating second-order gradient information (curvature) into optimization can dramatically reduce the number of iterations required to train machine learning models. In natural gradient descent, such information comes from the Fisher information matrix which yields a number of desirable properties. As exact natural gradient updates are intractable for large models, successful methods such as KFAC and sequels approximate the Fisher in a structured form that can easily be inverted. However, this requires model/layer-specific tensor algebra and certain approximations that are often difficult to justify. Here, we use ideas from Legendre-Fenchel duality to learn a direct and efficiently evaluated model for the product of the inverse Fisher with any vector, in an online manner, leading to natural gradient steps that get progressively more accurate over time despite noisy gradients. We prove that the resulting “Fisher-Legendre” (FishLeg) optimizer converges to a (global) minimum of non-convex functions satisfying the PL condition, which applies in particular to deep linear networks. On standard auto-encoder benchmarks, we show empirically that FishLeg outperforms standard first-order optimization methods, and performs on par with or better than other second-order methods, especially when using small batches. Thanks to its generality, we expect our approach to facilitate the handling of a variety neural network layers in future work.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics

Submission Guidelines: Yes

Please Choose The Closest Area That Your Submission Falls Into: Deep Learning and representational learning

18 Replies

Loading