Keywords: SGD, PL condition, Random Reshuffling, Interpolation
TL;DR: Random Reshuffling converges faster than SGD when the number of samples is less than the condition number of the problem without requiring large number of epochs under interpolation.
Abstract: Modern machine learning models are often over-parameterized and as a result they can interpolate
the training data. Under such a scenario, we study the convergence properties of a sampling-
without-replacement variant of Stochastic Gradient Descent (SGD), known as Random Reshuffling
(RR). Unlike SGD that samples data with replacement at every iteration, RR chooses a random
permutation of data at the beginning of each epoch. For under-parameterized models, it has been
recently shown that RR converges faster than SGD when the number of epochs is larger than the
condition number (κ) of the problem under standard assumptions like strong convexity. However,
previous works do not show that RR outperforms SGD under interpolation for strongly convex
objectives. Here, we show that for the class of Polyak-Łojasiewicz (PL) functions that generalizes
strong convexity, RR can outperform SGD as long as the number of samples (n) is less than the
parameter (ρ) of a strong growth condition (SGC).
0 Replies
Loading