Abstract: Two notable characteristics of modern GPU-accelerated HPC clusters are: (1) they increasingly run deep learning (DL) model-training workloads, and (2) they consist of multiple generations of GPUs, i.e., they are heterogeneous. However, existing works in GPU cluster scheduling for DL workloads have not addressed the GPU multi-generation problem. We propose DASH, a GPU cluster scheduler designed to optimally make a match between different DL workloads and GPU types in a multi-generational GPU environment. By lever-aging execution characteristics of co-scheduled DL workloads, DASH can improve the average job runtime by 17% and the average job completion time by 14 % compared to the traditional heterogeneity-unaware job scheduler.
Loading