Ray-based Elastic Distributed Data Parallel Framework with Distributed Data Cache

Haoran Lin, Xinwei Qin, Shuang Qiu, Yi Sun, Zekun Yin, Weiguo Liu

Published: 01 Jan 2023, Last Modified: 13 May 2025IPDPS Workshops 2023EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: With the development of large-scale machine learning, distributed data parallel has become the de facto standard strategy for model training. However, when training model using distributed data parallel on large-scale clusters, some unexpected factors may lead to the failure of the training tasks. Thus, a high-performance, scalabe, yet fault-tolerant distributed training framework is urgently needed. Most commonly used open-sourced distributed training frameworks (e.g., PyTorch) do not fully meet this need. In this paper, we have designed an elastic distributed training framework based on Ray (a high-performance distributed framework). Our framework takes advantage of Ray’s fault-tolerant store, scalability, and the stateful actor. In our framework, training tasks will not be terminated when the number of training processes changes. Moreover, we have designed an elastic distributed data cache using Ray’s object store and provided an efficient dataloader (called elastic_dataloader). Performance evaluation shows that elastic_dataloader is more than 2 times faster than PyTorch’s Dataloader on a cluster equipped with 10 Gigabit Ethernet.