Lucid: A Non-intrusive, Scalable and Interpretable Scheduler for Deep Learning Training Jobs

Qinghao Hu, Meng Zhang, Peng Sun, Yonggang Wen, Tianwei Zhang

Published: 01 Jan 2023, Last Modified: 12 May 2023ASPLOS (2) 2023Readers: Everyone

Abstract: While recent deep learning workload schedulers exhibit excellent performance, it is arduous to deploy them in practice due to some substantial defects, including inflexible intrusive manner, exorbitant integration and maintenance cost, limited scalability, as well as opaque decision processes. Motivated by these issues, we design and implement Lucid, a non-intrusive deep learning workload scheduler based on interpretable models. It consists of three innovative modules. First, a two-dimensional optimized profiler is introduced for efficient job metric collection and timely debugging job feedback. Second, Lucid utilizes an indolent packing strategy to circumvent interference. Third, Lucid orchestrates resources based on estimated job priority values and sharing scores to achieve efficient scheduling. Additionally, Lucid promotes model performance maintenance and system transparent adjustment via a well-designed system optimizer. Our evaluation shows that Lucid reduces the average job completion time by up to 1.3× compared with state-of-the-art preemptive scheduler Tiresias. Furthermore, it provides explicit system interpretations and excellent scalability for practical deployment.

0 Replies