Serving and Optimizing Machine Learning Workflows on Heterogeneous Infrastructures
Abstract: With the advent of ubiquitous deployment of smart devices and
the Internet of Things, data sources for machine learning inference
have increasingly moved to the edge of the network. Existing machine learning inference platforms typically assume a homogeneous
infrastructure and do not take into account the more complex and
tiered computing infrastructure that includes edge devices, local
hubs, edge datacenters, and cloud datacenters. On the other hand,
recent AutoML efforts have provided viable solutions for model
compression, pruning and quantization for heterogeneous environments; for a machine learning model, now we may easily find or
even generate a series of model variants with different tradeoffs
between accuracy and efficiency.
We design and implement JellyBean, a system for serving and optimizing machine learning inference workflows on heterogeneous
infrastructures. Given service-level objectives (e.g., throughput, accuracy), JellyBean picks the most cost-efficient models that meet
the accuracy target and decides how to deploy them across different
tiers of infrastructures. Evaluations show that JellyBean reduces
the total serving cost of visual question answering by up to 58%
and vehicle tracking from the NVIDIA AI City Challenge by up to
36%, compared with state-of-the-art model selection and worker
assignment solutions. JellyBean also outperforms prior ML serving
systems (e.g., Spark on the cloud) up to 5x in serving costs.
Loading