IPA: Inference Pipeline Adaptation to Achieve High Accuracy and Cost-Efficiency

IPA: Inference Pipeline Adaptation to Achieve High Accuracy and Cost-Efficiency

JSYS 2024 March Papers Submission1 Authors

22 Feb 2024 (modified: 19 Apr 2024)JSYS 2024 March Papers SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Inference Serving Systems, Inference Pipelines, Autoscaling, Machine Learning

Abstract: Efficiently optimizing multi-model inference pipelines for fast, accurate, and cost-effective inference is a crucial challenge in machine learning production systems, given their tight end-to-end latency requirements. To simplify the exploration of the vast and intricate trade-off space of latency, accuracy, and cost in inference pipelines, providers frequently opt to consider one of them. However, the challenge lies in reconciling latency, accuracy, and cost trade-offs. To address this challenge and propose a solution to efficiently manage model variants in inference pipelines, we present IPA, an online deep learning Inference Pipeline Adaptation system that efficiently leverages model variants for each deep learning task. Model variants are different versions of pre-trained models for the same deep learning task with variations in resource requirements, latency, and accuracy. IPA dynamically configures batch size, replication, and model variants to optimize accuracy, minimize costs, and meet user-defined latency Service Level Agreements (SLAs) using Integer Programming. It supports multi-objective settings for achieving different trade-offs between accuracy and cost objectives while remaining adaptable to varying workloads and dynamic traffic patterns. Navigating a wider variety of configurations allows IPA to achieve better trade-offs between cost and accuracy objectives compared to existing methods. Extensive experiments in a Kubernetes implementation with five real-world inference pipelines demonstrate that IPA improves end-to-end accuracy by up to 21% with a minimal cost increase.

Area: Systems for ML and ML for systems

Type: Solution

Revision: Yes

Previous Version: https://openreview.net/forum?id=25IIAWja31

Submission Number: 1

Loading