Keywords: Optimization theory, Variance reduction, Subsampled Newton, Finite-sum minimization, stochastic optimization
Abstract: Stochastic variance reduction has proven effective at accelerating
first-order algorithms for solving convex finite-sum optimization
tasks such as empirical risk minimization. Yet, the benefits of
variance reduction for first-order methods tend to vanish in the large-batch
setting, i.e., when stochastic gradients are computed from very
large mini-batches to leverage parallelization of modern computing architectures.
On the other hand, incorporating
second-order information via Newton-type methods has proven successful in improving the
performance of large-batch algorithms. In this work, we show that,
in the presence of second-order information, variance reduction in
the gradient can provide significant convergence acceleration even
when using extremely large-batch gradient estimates.
To demonstrate this, we study a finite-sum minimization algorithm we call Stochastic
Variance-Reduced Newton (SVRN). We show that SVRN
provably accelerates existing stochastic Newton-type methods (such as
Subsampled Newton), while
retaining their parallelizable large-batch operations: The number of
passes over the data is reduced from
$O(\alpha\log(1/\epsilon))$ to
$O\big(\frac{\log(1/\epsilon)}{\log(n)}\big)$,
i.e., by a factor of $O(\alpha\log(n))$, where $n$ is the number of
sum components and $\alpha$ is the approximation factor in the
Hessian estimate. Surprisingly, this acceleration gets more significant the larger the
data size $n$, and can be achieved with a unit step size.
Submission Number: 59
Loading