Stochastic Variance-Reduced Newton: Accelerating Finite-Sum Minimization with Large Batches

Published: 25 Jan 2025, Last Modified: 25 Jan 2025Accepted by TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: Stochastic variance reduction has proven effective at accelerating first-order algorithms for solving convex finite-sum optimization tasks such as empirical risk minimization. Incorporating second-order information has proven helpful in further improving the performance of these first-order methods. Yet, comparatively little is known about the benefits of using variance reduction to accelerate popular stochastic second-order methods such as Subsampled Newton. To address this, we propose Stochastic Variance-Reduced Newton (SVRN), a finite-sum minimization algorithm that provably accelerates existing stochastic Newton methods from $O(\alpha\log(1/\epsilon))$ to $O\big(\frac{\log(1/\epsilon)}{\log(n)}\big)$ passes over the data, i.e., by a factor of $O(\alpha\log(n))$, where $n$ is the number of sum components and $\alpha$ is the approximation factor in the Hessian estimate. Surprisingly, this acceleration gets more significant the larger the data size $n$, which is a unique property of SVRN. Our algorithm retains the key advantages of Newton-type methods, such as easily parallelizable large-batch operations and a simple unit step size. We use SVRN to accelerate Subsampled Newton and Iterative Hessian Sketch algorithms, and show that it compares favorably to popular first-order methods with variance~reduction.
Submission Length: Regular submission (no more than 12 pages of main content)
Previous TMLR Submission Url: https://openreview.net/forum?id=JHowrXNzSw
Changes Since Last Submission: Changes for the camera ready: - Added further motivation for the large mini-batches and the two complexity measures (top of page 3). - Simplified Table 1 to avoid confusion. - Added further motivation/clarification for the Hessian approximation condition (2). - Extended Lemma 8 in Appendix D.2, to allow for Hessian approximations based on fewer than kappa Hessian component samples. - Added Remark 3, discussing the dependence on the condition number in the statement of Theorem 3, and when it can be avoided. - Added a clarification to Figure 2, and a clarification of the experimental setup. - Expanded the discussion of the communication cost of SVRN, in the context of gradient mini-batch resampling (Section 5.2).
Code: https://github.com/svrnewton/svrn
Assigned Action Editor: ~Murat_A_Erdogdu1
Submission Number: 2581
Loading