Abstract: In the distributed graident coding problem, it has been established that, to exactly recover the gradient under s slow machines, the mmimum computation load (number of stored data partitions) of each worker is at least linear ($s+1$), which incurs a large overhead when s is large~\citetandon2017gradient. In this paper, we focus on approximate gradient coding that aims to recover the gradient with bounded error ε. Theoretically, our main contributions are three-fold: (i) we analyze the structure of optimal gradient codes, and derive the information-theoretical lower bound of minimum computation load: $O(łog(n)/łog(n/s))$ for ε = 0$ and $d\geq O(łog(1/ε)/łog(n/s))$ for ε>0$, where d is the computation load, and ε is the error in the gradient computation; (ii) we design two approximate gradient coding schemes that exactly match such lower bounds based on random edge removal process; (iii) we implement our schemes and demonstrate the advantage of the approaches over the current fastest gradient coding strategies. The proposed schemes provide order-wise improvement over the state of the art in terms of computation load, and are also optimal in terms of both computation load and latency.
0 Replies
Loading