Tereis: A Package-Based Scheduling in Deep Learning Systems

Published: 2022, Last Modified: 16 Nov 2024ICPADS 2022EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Deep learning (DL) systems are typically used to accelerate training DL jobs. Training DL models requires feeding mass input data. It takes a long time to transfer training data from the storage nodes to the compute nodes. However, the computational resources of the GPUs are idle during the data transmission period, which results in a waste of computing resources. In DL systems, a large number of short-term jobs are queuing longer than their own execution times. Meanwhile, many multi-GPU jobs are suffering a long-queuing time due to not enough free GPU. To the best of our knowledge, no studies try to use the idle computation resources of GPU in the data transmission period.We propose Tereis, a package-based scheduler to make full use of GPU. Tereis predicts a DL job’s execution time and data transmission time, then safely package two jobs on the same GPU. One of the packaged jobs will be completed before the other job ends transferring data, which is what ‘safely’ means. This ‘safe’ Packaging does not cause GPU contention. Tereis also designs multi-level queues to prevent starvation. We implemented Tereis in python on the actual cluster and evaluated its performance. The experiment results show that Tereis decreases the average waiting time by 3.2× to10.5×, and has an improvement of 18% to 42% on makespan compared to other methods. Furthermore, we have made large-scale simulations to explore the sensitivities of Tereis. We find that Tereis performs better in the scenario where the job set has a distribution of large dataset size and jobs are submitted frequently.
Loading