Towards Understanding Why Lookahead Generalizes Better Than SGD and Beyond

Pan Zhou; Hanshu YAN; Xiaotong Yuan; Jiashi Feng; Shuicheng YAN

Towards Understanding Why Lookahead Generalizes Better Than SGD and Beyond

Pan Zhou, Hanshu YAN, Xiaotong Yuan, Jiashi Feng, Shuicheng YAN

Published: 09 Nov 2021, Last Modified: 26 Jan 2025NeurIPS 2021 PosterReaders: Everyone

Keywords: lookahead, excess risk error, deep learning optimization, deep learning generalization

TL;DR: We theoretically analyze why lookahead can achieve better test performance than its inner-loop optimizer via showing its advantages on achieving smaller excess risk error.

Abstract: To train networks, lookahead algorithm~\cite{zhang2019lookahead} updates its fast weights $k$ times via an inner-loop optimizer before updating its slow weights once by using the latest fast weights. Any optimizer, e.g. SGD, can serve as the inner-loop optimizer, and the derived lookahead generally enjoys remarkable test performance improvement over the vanilla optimizer. But theoretical understandings on the test performance improvement of lookahead remain absent yet. To solve this issue, we theoretically justify the advantages of lookahead in terms of the excess risk error which measures the test performance. Specifically, we prove that lookahead using SGD as its inner-loop optimizer can better balance the optimization error and generalization error to achieve smaller excess risk error than vanilla SGD on (strongly) convex problems and nonconvex problems with Polyak-{\L}ojasiewicz condition which has been observed/proved in neural networks. Moreover, we show the stagewise optimization strategy~\cite{barshan2015stage} which decays learning rate several times during training can also benefit lookahead in improving its optimization and generalization errors on strongly convex problems. Finally, we propose a stagewise locally-regularized lookahead (SLRLA) algorithm which sums up the vanilla objective and a local regularizer to minimize at each stage and provably enjoys optimization and generalization improvement over the conventional (stagewise) lookahead. Experimental results on CIFAR10/100 and ImageNet testify its advantages. Codes is available at \url{https://github.com/sail-sg/SLRLA-optimizer}.

Code Of Conduct: I certify that all co-authors of this work have read and commit to adhering to the NeurIPS Statement on Ethics, Fairness, Inclusivity, and Code of Conduct.

Supplementary Material: pdf

Code: https://github.com/sail-sg/SLRLA-optimizer

11 Replies

Loading