TOP: Task-Based Operator Parallelism for Asynchronous Deep Learning Inference on GPU

Published: 01 Jan 2025, Last Modified: 06 Feb 2025IEEE Trans. Parallel Distributed Syst. 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Current deep learning compilers have made significant strides in optimizing computation graphs for single- and multi-model scenarios. However, they lack specific optimizations for asynchronous multi-task inference systems. In such systems, tasks arrive dynamically, leading to diverse inference progress for each model. This renders traditional optimization strategies based solely on the original computation graph suboptimal or even invalid. Furthermore, existing operator scheduling methods do not account for parallel task pipelines involving the same model. Task pipelines present additional opportunities for optimization. Therefore, we propose Task-based Operator Parallelism (TOP). TOP incorporates an understanding of the impact of task arrival patterns on the inference progress of each model. It leverages the multi-agent reinforcement learning algorithm MADDPG to cooperatively optimize the task launcher and model scheduler, generating an optimal pair of dequeue frequency and computation graph. The objective of TOP is to enhance resource utilization, increase throughput, and allocate resources judiciously to prevent task backlog. To expedite the optimization process in TOP, we introduce a novel stage partition method using the GNN-based Policy Gradient (GPG) algorithm. Through extensive experiments on various devices, we demonstrate the efficacy of TOP. It outperforms the state-of-the-art in operator scheduling for both single- and multi-model task processing scenarios. Benefiting from TOP, we can significantly enhance the throughput of a single model by increasing its concurrency or batch size, thereby achieving self-acceleration.
Loading