Keywords: SGD, PL condition, Random Reshuffling, Interpolation
TL;DR: Random Reshuffling converges faster than SGD when the number of samples is less than the condition number of the problem without requiring large number of epochs under interpolation.
Abstract: Modern machine learning models are often over-parameterized and as a result they can interpolate the training data. Under such a scenario, we study the convergence properties of a sampling- without-replacement variant of Stochastic Gradient Descent (SGD), known as Random Reshuffling (RR). Unlike SGD that samples data with replacement at every iteration, RR chooses a random permutation of data at the beginning of each epoch. For under-parameterized models, it has been recently shown that RR converges faster than SGD when the number of epochs is larger than the condition number (κ) of the problem under standard assumptions like strong convexity. However, previous works do not show that RR outperforms SGD under interpolation for strongly convex objectives. Here, we show that for the class of Polyak-Łojasiewicz (PL) functions that generalizes strong convexity, RR can outperform SGD as long as the number of samples (n) is less than the parameter (ρ) of a strong growth condition (SGC).