Latency-Aware GPU Scheduling for DL Inference Tasks with Internal Resource Partitioning Strategy

Taewoo Kim, Changha Lee, Chan-Hyun Youn

Published: 2024, Last Modified: 02 Mar 2026ICTC 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: With advancements in architecture and open-source foundation models extracting general knowledge, many developers are leveraging deep learning (DL) as application services. To coduct computation-intensive matrix operations, hardware resources with numerous parallel cores have also seen significant progress. Due to the high operational costs of such resources, efficient processing of DL tasks has become crucial for computing resource providers. Most studies on resource management for DL inference tasks focus on adjusting batch sizes to control the efficiency of a resource. Different from training tasks, however, inference processing uses relatively few internal resources, such as parallel cores and memory units. Consequently, a single inference task often fails to fully utilize the resource. Moreover, the latency prediction model, which is linearly proportional to batch size, still has limitations with small batch sizes and diverse archi-tectures. In this paper, we introduce a novel GPU scheduler to improve processing efficiency by partitioning internal resources and executing multiple tasks simultaneously in a single GPU. Our latency model considers the internal kernel operation distribution, enabling precise prediction even in small batch sizes. By adjusting both batch size and usage ratio of internal resources the proposed scheduler can achieve the coexistence of multi-tenant DL tasks effectively. Experimental evaluations demonstrate that our approach achieves higher throughput and power efficiency compared to conventional scheduling mechanisms.

External IDs:dblp:conf/ictc/0003LY24