DProbe: Profiling and Predicting Multi-tenant Deep Learning Workloads for GPU Resource Scaling

Zechun Zhou, Jingwei Sun, Hengquan Mei, Peng Sun, Guangzhong Sun

Published: 2024, Last Modified: 18 Mar 2026Euro-Par (1) 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: The surge in deep learning services has precipitated the development of modern large-scale GPU datacenters, which cater to the computational demands of multi-tenant deep learning workloads. These facilities implement virtual cluster partitioning to maintain isolation across product groups. Dynamically adjusting resource allocation across virtual clusters can effectively enhance resource utilization. However, efficient GPU resource scaling hinges on accurately forecasting resource demand trends, which is a task complicated by significant variations in GPU utilization among diverse deep learning instances. For this issue, we propose DProbe, a system designed to predict resource demand trends within virtual clusters, employing fine-grained profiling of multi-tenant deep learning workloads. Initially, DProbe employs a job profiler that integrates model-specific attributes with runtime hardware metrics to perform performance modeling for deep learning instances. Resource demands are then estimated through a multi-level approach, considering the distribution of instances across varying levels of GPU utilization. Additionally, DProbe incorporates a multi-task trend predictor to anticipate future resource demand trends, based on historical traces. DProbe’s predictions enable efficient resource scaling across virtual clusters. We evaluate DProbe using production traces across five scheduling policies and effectively reduce the average job queuing delay by 22.4% to 50.7%.

External IDs:dblp:conf/europar/ZhouSMSS24