Cost-Efficient Serverless Inference Serving with Joint Batching and Multi-Processing

Shen Cai, Zhi Zhou, Kongyange Zhao, Xu Chen

Published: 01 Jan 2023, Last Modified: 06 Aug 2024APSys 2023EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: With the emerging of machine learning, many commercial companies increasingly utilize machine learning inference systems as backend services to improve their products. Serverless computing is a modern paradigm that provides auto-scaling, event-driven services, making it particularly well-suited for various domains, including video stream analysis, IoT serving and machine learning applications. The flexible scaling feature of serverless computing is adept at handling the burstiness of ML workloads. However, despite its compatibility with ML inference tasks, the cost of serverless inference systems remain relatively high in comparison to traditional serving paradigms, primarily due to the under-utilization of CPU resources offered by serverless platforms. To tackle this challenge, we design and deploy a serverless inference serving system that incorporates batching and multi-process mechanisms to enhance cost efficiency. By applying a change-point detection algorithm to manage bursty workloads, it optimizes resource usage and achieves lower costs. We employ an Amazon EC2 server for handling request packaging and running the core Bayesian Optimization algorithm without any prior information. The preliminary system, implemented on AWS Lambda, can significantly reduce expenses and save up to 62% compared to the original serverless inference system.