DASH: Scheduling Deep Learning Workloads on Multi-Generational GPU-Accelerated Clusters

Baolin Li, Tirthak Patel, Vijay Gadepally, Karen Gettings, Siddharth Samsi, Devesh Tiwari

Published: 2022, Last Modified: 08 Mar 2025HPEC 2022EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Two notable characteristics of modern GPU-accelerated HPC clusters are: (1) they increasingly run deep learning (DL) model-training workloads, and (2) they consist of multiple generations of GPUs, i.e., they are heterogeneous. However, existing works in GPU cluster scheduling for DL workloads have not addressed the GPU multi-generation problem. We propose DASH, a GPU cluster scheduler designed to optimally make a match between different DL workloads and GPU types in a multi-generational GPU environment. By lever-aging execution characteristics of co-scheduled DL workloads, DASH can improve the average job runtime by 17% and the average job completion time by 14 % compared to the traditional heterogeneity-unaware job scheduler.