TL;DR: Simplifying inference algorithms and embedding them into product code reduces lowers inference latency by 1.3x, CPU resources by 30%, and network communication by 50% between application front-end and ML back-end.
Abstract: Many ML applications and products train on medium amounts of input data but get bottlenecked in real-time inference. When implementing ML systems, conventional wisdom favors segregating ML code into services queried by product code via Remote Procedure Call (RPC) APIs. This approach clarifies the overall software architecture and simplifies product code by abstracting away ML internals. However, the separation adds network latency and entails additional CPU overhead. Hence, we simplify inference algorithms and embed them into the product code to reduce network communication. For public data sets and a high-performance real-time platform that deals with tabular data, we show that over half of the inputs are often amenable to such optimization, while the remainder can be handled by the original model. By applying our optimization with AutoML to both training and inference, we reduce inference latency by 1.3x, CPU resources by 30%, and network communication between application front-end and ML back-end by about 50% for a commercial end-to-end ML platform that serves millions of real-time decisions per second. The crucial role of AutoML is in configuring first-stage inference and balancing the two stages.
Keywords: Efficient Inference, Green AutoML, Joint AutoML, End-to-End Learning for Tabular Data, Machine Learning, Multi-stage Inference, Inference Optimization
Submission Checklist: Yes
Broader Impact Statement: Yes
Paper Availability And License: Yes
Code Of Conduct: Yes
Code And Dataset Supplement: zip
Steps For Environmental Footprint Reduction During Development: This work required a modest amount of CPU resources and we made efforts not to rerun experiments multiple times. The goal of this work, however, is to reduce net CPU resources. So overall, utilizing our approach in a production environment would reduce an environmental footprint.
CPU Hours: 0
GPU Hours: 0
TPU Hours: 0