Minibatch vs Local SGD with Shuffling: Tight Convergence Bounds and BeyondDownload PDF

29 Sept 2021, 00:35 (edited 15 Mar 2022)ICLR 2022 OralReaders: Everyone
  • Keywords: Local SGD, Minibatch SGD, Shuffling, Without-replacement, Convex Optimization, Stochastic Optimization, Federated Learning, Large Scale Learning, Distributed Learning
  • Abstract: In distributed learning, local SGD (also known as federated averaging) and its simple baseline minibatch SGD are widely studied optimization methods. Most existing analyses of these methods assume independent and unbiased gradient estimates obtained via with-replacement sampling. In contrast, we study shuffling-based variants: minibatch and local Random Reshuffling, which draw stochastic gradients without replacement and are thus closer to practice. For smooth functions satisfying the Polyak-Łojasiewicz condition, we obtain convergence bounds (in the large epoch regime) which show that these shuffling-based variants converge faster than their with-replacement counterparts. Moreover, we prove matching lower bounds showing that our convergence analysis is tight. Finally, we propose an algorithmic modification called synchronized shuffling that leads to convergence rates faster than our lower bounds in near-homogeneous settings.
  • One-sentence Summary: We provide tight upper and lower bounds on convergence rates of shuffling-based minibatch SGD and local SGD, and propose an algorithmic modification that improves convergence rates beyond our lower bounds.
  • Supplementary Material: zip
9 Replies