A Coefficient Makes SVRG Effective

16 Sept 2023 (modified: 11 Feb 2024)Submitted to ICLR 2024EveryoneRevisionsBibTeX
Primary Area: optimization
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Keywords: Optimization; Variance Reduction; SGD
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.
TL;DR: Introducing a coefficient to control the variance reduction strength in SVRG makes it effective for deep networks.
Abstract: Stochastic Variance Reduced Gradient (SVRG), introduced by Johnson & Zhang (2013), is a theoretically compelling optimization approach. However, as underscored by Defazio & Bottou (2019), its practical effectiveness in deep learning is yet to be proven. In this work, we unveil the potential of SVRG in optimizing real-world neural networks. Our analysis reveals that the variance reduction strength in SVRG should be lower for deep networks and decrease as training progresses. This insight inspires us to introduce a multiplicative coefficient $\alpha$ to control its strength and adjust it with a linear decay schedule. We name our method $\alpha$-SVRG. Our results demonstrate $\alpha$-SVRG better optimizes neural networks, consistently lowering the training loss compared to both baseline and standard SVRG across various architectures and datasets. Our work is the first to bring the benefit of SVRG into training neural networks at a practical scale. We hope it encourages further exploration into gradient variance reduction techniques in deep learning.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.
Supplementary Material: zip
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 763
Loading