TL;DR: We prove that using indices from random permutations for coordinate descent (RPCD) is provably faster than using uniformly sampled indices (RCD) for a subclass of positive definite quadratic functions.
Abstract: We analyze the convergence rates of two popular variants of coordinate descent (CD): random CD (RCD), in which the coordinates are sampled uniformly at random, and random-permutation CD (RPCD), in which random permutations are used to select the update indices. Despite abundant empirical evidence that RPCD outperforms RCD in various tasks, the theoretical gap between the two algorithms’ performance has remained elusive. Even for the benign case of positive-definite quadratic functions with permutation-invariant Hessians, previous efforts have failed to demonstrate a provable performance gap between RCD and RPCD. To this end, we present novel results showing that, for a class of quadratics with permutation-invariant structures, the contraction rate upper bound for RPCD is always strictly smaller than the contraction rate lower bound for RCD for every individual problem instance. Furthermore, we conjecture that this function class contains the worst-case examples of RPCD among all positive-definite quadratics. Combined with our RCD lower bound, this conjecture extends our results to the general class of positive-definite quadratic functions.
Lay Summary: Many machine learning methods solve optimization problems by updating one variable at a time—this is called coordinate descent. One popular version picks a random coordinate at each step independently. Another uses a random order (a permutation) of all variables, updating them one by one. While this second method is often faster in practice, we didn’t have a solid mathematical explanation for why this is the case.
Our work shows, for the first time, that this permutation-based method is provably faster in certain problem instances. We studied a particular type of well-behaved objective functions for which we have proved that the permutation approach always converges faster than the independently random version. We also suggest evidence that these cases are likely the worst-case examples, meaning that similar benefits might also be ensured for a more general class of problems.
These results help us better understand how randomness can speed up optimization and may lead to faster algorithms in machine learning systems used in large-scale applications.
Primary Area: Optimization->Stochastic
Keywords: Optimization, Stochastic Optimization, Coordinate Descent, Random Permutations
Submission Number: 15629
Loading