Parallel Stochastic Gradient Descent with Sound Combiners

Saeed Maleki, Madanlal Musuvathi, Todd Mytkowicz, Yufei Ding

Nov 04, 2016 (modified: Dec 04, 2016) ICLR 2017 conference submission readers: everyone
  • Abstract: Stochastic gradient descent (SGD) is a well-known method for regression and classification tasks. However, it is an inherently sequential algorithm — at each step, the processing of the current example depends on the parameters learned from the previous examples. Prior approaches to parallelizing SGD, such as Hogwild! and AllReduce, do not honor these dependences across threads and thus can potentially suffer poor convergence rates and/or poor scalability. This paper proposes SymSGD, a parallel SGD algorithm that retains the sequential semantics of SGD in expectation. Each thread in this approach learns a local model and a probabilistic model combiner that allows the local models to be combined to produce the same result as what a sequential SGD would have produced, in expectation. This SymSGD approach is applicable to any linear learner whose update rule is linear. This paper evaluates SymSGD’s accuracy and performance on 9 datasets on a shared-memory machine shows up-to 13× speedup over our heavily optimized sequential baseline on 16 cores.
  • TL;DR: This paper proposes SymSGD, a parallel SGD algorithm that retains the sequential semantics of SGD in expectation.
  • Conflicts: microsoft.com, illinois.edu, ncsu.edu

Loading