Abstract: Real-time deep learning inference serving systems often require prohibitive resources and diverse user requirements. The existing design of inference serving systems mainly focusing on computation resource efficiency, largely ignoring the trade-off between computation and bandwidth resources in need. Sub-optimal resource utilization usually leads to huge serving cost waste. In this paper, we tackle the dual challenge of computation-bandwidth trade-off and cost-effectiveness by proposing sys, an efficient joint Adaptive model, and Adaptive data deep learning serving solution across the geo-datacenters. Inspired by the insight that a trade-off between computational cost and bandwidth cost in achieving the same accuracy, we design a real-time inference serving framework, which selectively places different "versions" of the deep learning models at different geo-locations, and schedules different data sample versions to be sent to those model versions for inference. The goal is to minimize the total serving cost while meeting latency and accuracy demand for the serving requests. We formulate a joint placement and serving problem and propose an efficient approximation algorithm to solve it with a theoretical performance guarantee. We deploy sys on Amazon EC2 for experiments, which shows that sys achieves 30%-50% serving cost reduction under the same required latency and accuracy as compared to baselines.
0 Replies
Loading