Abstract: In distributed machine learning (DML), the straggler problem caused by heterogeneous environment and external factors leads to high synchronization overhead and retards the ML training progress. To alleviate the straggler problem, we propose a new dynamic optimal synchronous parallel (DOSP) strategy that performs partial synchronization based on dynamic clustering of iteration completion time. First, we present a model to calculate the completion time of DML parameter training. Then, we define the optimal synchronization point of partial synchronization scheme and design the synchronization scheme of iteration completion time clustering. Finally, inspired by the delay phenomenon with narrow slot between adjacent synchronization points in synchronization process, we define a gradient aggregation time slot to guide the synchronization evaluation and obtain the optimal synchronization point. The whole idea has been implemented in a prototype called STAR(Our implementation is available at https://github.com/oumiga1314/opt_experient.). Experimental results carried out on STAR show that DOSP improves the training accuracy by 1–3% and the training speed by 1.24–2.93x compared with other existing schemes.
0 Replies
Loading