Fast Distributed Deep Learning via Worker-adaptive Batch Sizing

Chen Chen, Qizhen Weng, Wei Wang, Baochun Li, Bo Li

Published: 2018, Last Modified: 11 May 2023SoCC 2018Readers: Everyone

Abstract: In heterogeneous or shared clusters, distributed learning processes are slowed down by straggling workers. In this work, we propose LB-BSP, a new synchronization scheme that eliminates stragglers by adapting each worker's training load (batch size) to its processing capability. For training in shared production clusters, a prerequisite for deciding the workers' batch sizes is to know their processing speeds before each iteration starts. To this end, we adopt NARX, an extended recurrent neural network that accounts for both the historical speeds and the driving factors such as CPU and memory in prediction.

0 Replies