Keywords: low-capacity model, large-scale prediction, efficient inference, hybrid networks, routing nets, coverage and latency, FLOPs
Abstract: Although deep neural networks (DNNs) achieve state-of-the-art accuracy on large-scale and fine-grained prediction tasks, they are high capacity models and often cannot be deployed on edge devices. As such, two distinct paradigms have emerged in parallel: 1) edge device inference for low-level tasks, 2) cloud-based inference for large-scale tasks. We propose a novel hybrid option, which marries these extremes and seeks to bring the latency and computational cost benefits of edge device inference to tasks currently deployed in the cloud. Our proposed method is an end-to-end approach, and involves architecting and training two networks in tandem. The first network is a low-capacity network that can be deployed on an edge device, whereas the second is a high-capacity network deployed in the cloud. When the edge device encounters challenging inputs, these inputs are transmitted and processed on the cloud. Empirically, on the ImageNet classification dataset, our proposed method leads to substantial decrease in the number of floating point operations (FLOPs) used compared to a well-designed high-capacity network, while suffering no excess classification loss. A novel aspect of our method is that, by allowing abstentions on a small fraction of examples ($<20\%$), we can increase accuracy without increasing the edge device memory and FLOPs substantially (up to $7$\% higher accuracy and $3$X fewer FLOPs on ImageNet with $80$\% coverage), relative to MobileNetV3 architectures.
One-sentence Summary: Large-scale prediction tasks are handled by high-capacity DNNs in the cloud. To handle majority of the workload on low-capacity devices, we propose a hybrid approach, which passes on few and unusually hard examples to the high-capacity cloud model.
17 Replies
Loading