Parallel Stochastic Gradient Descent with Sound CombinersDownload PDF

24 Apr 2024 (modified: 21 Jul 2022)Submitted to ICLR 2017Readers: Everyone
Abstract: Stochastic gradient descent (SGD) is a well-known method for regression and classification tasks. However, it is an inherently sequential algorithm — at each step, the processing of the current example depends on the parameters learned from the previous examples. Prior approaches to parallelizing SGD, such as Hogwild! and AllReduce, do not honor these dependences across threads and thus can potentially suffer poor convergence rates and/or poor scalability. This paper proposes SymSGD, a parallel SGD algorithm that retains the sequential semantics of SGD in expectation. Each thread in this approach learns a local model and a probabilistic model combiner that allows the local models to be combined to produce the same result as what a sequential SGD would have produced, in expectation. This SymSGD approach is applicable to any linear learner whose update rule is linear. This paper evaluates SymSGD’s accuracy and performance on 9 datasets on a shared-memory machine shows up-to 13× speedup over our heavily optimized sequential baseline on 16 cores.
TL;DR: This paper proposes SymSGD, a parallel SGD algorithm that retains the sequential semantics of SGD in expectation.
Conflicts: microsoft.com, illinois.edu, ncsu.edu
14 Replies

Loading