Abstract: Deep learning accelerators, as the cornerstone of machine learning systems, expedite neural network training and inference. With their computational power escalating annually, multi-tasking becomes imperative to harness their full potential. However, similar to other parallel processing systems, deep learning accelerators confront the problem of performance fluctuations which could result in unpredictable kernel latency, suboptimal resource utilization, and exacerbated tail latency. This paper identifies the unfairness in stream-level scheduling as the root cause of these performance fluctuations. To mitigate this issue, we introduce Leaf, an innovative learning-based stream-level fair scheduling method that dynamically learns and adapts scheduling policies by leveraging feature vectors extracted from enqueued kernels and accelerator status. To tackle the issue of scalability and latency constraints inherent in stream-level scheduling, we devise a scalable scheduling framework and an online scheduler switching mechanism for Leaf. Preliminary implementation on a commercial grade deep learning accelerator demonstrates that Leaf can significantly reduce kernel latency variation by $10 \sim 20$ times, sustain high and stable resource utilization, and markedly decrease workload runtime by mitigating tail latency, outperforming the accelerator’s native scheduler.
Loading