Improving Efficiency of Neural Image Classification and Object Detection Systems using Automated Layer Caching

TMLR Paper1863 Authors

23 Nov 2023 (modified: 22 Mar 2024)Rejected by TMLREveryoneRevisionsBibTeX
Abstract: Deep Neural Networks (DNNs) have become an essential component in many application domains including web-based services. A variety of these services require high throughput and (close to) real-time features, for instance, to respond or react to users’ requests or to process a stream of incoming data on time. However, the trend in DNN design is toward larger models with many layers and parameters to achieve more accurate results. Although these models are often pre-trained, the computational complexity in such large models can still be relatively significant, hindering low inference latency. In this paper, we propose an end-to-end automated caching solution to improve the performance of DNN-based services in terms of their computational complexity and inference latency. Our method adopts the ideas of self-distillation of DNN models and early-exits. The proposed solution is an automated online layer caching mechanism that allows early-exiting of a large model during inference time if the cache model in one of the early-exits is confident enough for final prediction. One of the main contributions of this paper is that we have implemented the idea as an online caching, meaning that the cache models do not need access to training data and perform solely based on the incoming data at run-time, making it suitable for applications using pre-trained models. Our experiments results on two downstream tasks (image classification and object detection) show that, on average, caching can reduce computational complexity of these services up to 58% (in terms of FLOPs count) and improve their inference latency up to 46% with low to zero reduction in accuracy. Our approach also outperforms existing approaches, particularly when being applied on complex models and larger datasets. It achieves a remarkable 51.6% and 30.4% reduction in latency, surpassing the Gati and BranchyNet methods for CIFAR100-Resnet50. This enhancement is accompanied by 2.92% and 0.87% increase in mean accuracy, further highlighting the superiority of our approach in demanding scenarios
Submission Length: Long submission (more than 12 pages of main content)
Changes Since Last Submission: We are currently on Revision R1.1, which has addressed some of the identified issues. Based on the comments, we plan to complete 2 more revisions. Please refer to the PDF file for details.
Assigned Action Editor: ~Evan_G_Shelhamer1
Submission Number: 1863
Loading