Optimus: Warming Serverless ML Inference via Inter-Function Model Transformation

Published: 01 Jan 2024, Last Modified: 30 Sept 2024EuroSys 2024EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Serverless ML inference is an emerging cloud computing paradigm for low-cost, easy-to-manage inference services. In serverless ML inference, each call is executed in a container; however, the cold start of containers results in long inference delays. Unfortunately, most existing works do not work well because they still need to load models into containers from scratch, which is the bottleneck based on our observations. Therefore, this paper proposes a low-latency serverless ML inference system called Optimus via a new container management scheme. Our key insight is that the loading of a new model can be significantly accelerated when using an existing model with a similar structure in a warm but idle container. We thus develop a novel idea of inter-function model transformation for serverless ML inference, which delves into models within containers at a finer granularity of operations, designs a set of in-container meta-operators for both CNN and transformer model transformation, and develops an efficient scheduling algorithm with linear complexity for a low-cost transformation strategy. Our evaluations on thousands of models show that Optimus reduces inference latency by 24.00% ~ 47.56% in both simulated and real-world workloads compared to state-of-the-art work.
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview