Petrel: Community-aware Synchronous Parallel for Heterogeneous Parameter ServerDownload PDFOpen Website

2020 (modified: 02 Nov 2022)ICDCS 2020Readers: Everyone
Abstract: As to address the impact of heterogeneity in distributed Deep Learning (DL) systems, most previous approaches focus on prioritizing the contribution of fast workers and reducing the involvement of slow workers, incurring the limitations of workload imbalance and computation inefficiency. We reveal that grouping workers into communities, an abstraction proposed by us, and handling parameter synchronization in community level can conquer these limitations and accelerate the training convergence progress. The inspiration of community comes from our exploration of prior knowledge about the similarity between workers, which is often neglected by previous work. These observations motivate us to propose a new synchronization mechanism named Community-aware Synchronous Parallel (CSP), which uses the Asynchronous Advantage Actor-Critic (A3C), a Reinforcement Learning (RL) based algorithm, to intelligently determine community configuration and fully improve the synchronization performance. The whole idea has been implemented in a system called Petrel that achieves a good balance between convergence efficiency and communication overhead. The evaluation under different benchmarks demonstrates our approach can effectively accelerate the training convergence speed and reduce synchro-nization traffic.
0 Replies

Loading