DeInfer: A GPU resource allocation algorithm with spatial sharing for near-deterministic inferring tasks

Yingwen Chen; Wenxin Li; Huan Zhou; Xiangrui Yang; Yanfei Yin

DeInfer: A GPU resource allocation algorithm with spatial sharing for near-deterministic inferring tasks

Yingwen Chen, Wenxin Li, Huan Zhou, Xiangrui Yang, Yanfei Yin

Published: 01 Jan 2024, Last Modified: 16 Nov 2024ICPP 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: For the applications of artificial intelligence, training models with GPUs are widely noted, while inferring requirements are somehow neglected. In some scenario, it is quite important to finish the deep learning inference (DLI) task and get the response in time, e.g., anomaly detection in AIOps or QoE (Quality of Experience) assurance for customers. However, the challenges of GPU inferring are as follows: 1) the interference among inferring tasks on the shared GPU are not well studied and even not considered, which may cause a surge in the inference latency due to the hardware contention; 2) the deadline miss rate caused by the arrival rate is not clearly considered, which often exhibits significant fluctuations in real-world cases. Therefore, the interference among tasks and arrival rates of tasks should be well designed to decrease the deadline miss rate, when sharing GPU resources. To tackle the issue, we propose the algorithm, DeInfer, through following manners: 1) we identify the key factors that lead to interference, conduct a systematic study and develop a highly accurate interference prediction algorithm based on the random forest algorithm, achieving a four times improvement compared to the state-of-the-art interference prediction algorithms; 2) we utilize the queue theory to model the randomness of the arrival process and put forward a GPU resource allocation algorithm, which reduces the deadline miss rate by an average of over 30%.

Loading