A Unified Coded Deep Neural Network Training Strategy based on Generalized PolyDot codes

Sanghamitra Dutta, Ziqian Bai, Haewon Jeong, Tze Meng Low, Pulkit Grover

2018 (modified: 12 Nov 2022)ISIT 2018Readers: Everyone

Abstract: This paper has two main contributions. First, we propose a novel coding technique - Generalized PolyDot - for matrix-vector products that advances on existing techniques for coded matrix operations under storage and communication constraints. Next, we use Generalized PolyDot for the problem of training large Deep Neural Networks (DNNs) using unreliable nodes that are prone to soft-errors, e.g., bit flips during computation that produce erroneous outputs. An additional difficulty imposed by the problem of DNN training is that the parameter values (weight matrices) are updated at every iteration, and thus require a prohibitively large encoding cost at every iteration if we naively extend existing coded computing techniques. Thus, we propose a “unified” coded DNN training strategy where we weave coding into the operations of DNN training itself, so that the weight matrices, once initially encoded, remain encoded during updates with negligible encoding/decoding overhead per iteration. Moreover, our strategy can also allow for errors even in the nonlinear step of training. Finally, our coded DNN training strategy is completely decentralized: no assumptions on the presence of a master node are made, which avoids any single point of failure under soft-errors. Our strategy can provide unboundedly better error tolerance than the competing replication strategy and an MDS-code-based strategy [1].

0 Replies