Irina: Accelerating DNN Inference with Efficient Online Scheduling

Xiaorui Wu, Hong Xu, Yi Wang

2020 (modified: 03 Nov 2022)APNet 2020Readers: Everyone

Abstract: DNN inference is becoming prevalent for many real-world applications. Current machine learning frameworks usually schedule inference tasks with the goal of optimizing throughput under predictable workloads and task arrival patterns. Yet, inference workloads are becoming more dynamic with bursty queries generated by various video analytics pipelines which run expensive inference only on a fraction of video frames. Thus it is imperative to optimize the completion time of these unpredictable queries and improve customer experience. We propose the preliminary design of the first online inference task scheduling system, called Irina, that takes completion time under unpredictable workload as its primary objective. Irina augments the design space of inference task scheduling with three new strategies, namely batching, stacking, and preemption, in order to more flexibly schedule the tasks and reduce overall latency. Simulation results with empirical inference execution data shows that Irina can improve average task completion time by 1.3x–2.5x over TensorFlow Serving scheduling.

0 Replies