TwinDNN: A Tale of Two Deep Neural Networks

Hyunmin Jeong; Deming Chen

TwinDNN: A Tale of Two Deep Neural Networks

Hyunmin Jeong, Deming Chen

28 Sept 2020 (modified: 05 May 2023)ICLR 2021 Conference Blind SubmissionReaders: Everyone

Keywords: Hardware Accelerator, High-Level-Synthesis, Machine Learning, Neural Network Quantization

Abstract: Compression technologies for deep neural networks (DNNs), such as weight quantization, have been widely investigated to reduce the DNN model size so that they can be implemented on hardware with strict resource restrictions. However, one major downside of model compression is accuracy degradation. To deal with this problem effectively, we propose a new compressed network inference scheme, with a high accuracy but slower DNN coupled with its highly compressed DNN version that typically delivers much faster inference speed but with a lower accuracy. During inference, we determine the confidence of the prediction of the compressed DNN, and infer the original DNN for the inputs that are considered not confident by the compressed DNN. The proposed design can deliver overall accuracy close to the high accuracy model, but with the latency closer to the compressed DNN. We demonstrate our design on two image classification tasks: CIFAR-10 and ImageNet. Our experiments show that our design can recover up to 94% of accuracy drop caused by extreme network compression, with more than 90% increase in throughput compared to just using the original DNN. This is the first work that considers using a highly compressed DNN along with the original DNN in parallel to improve latency significantly while effectively maintaining the original model accuracy.

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics

One-sentence Summary: This study presents a way to accelerate neural network inferences by using an extremely low bit-width network, while maintaining the accuracy of the original network by using relatively high precision network concurrently.

Reviewed Version (pdf): https://openreview.net/references/pdf?id=lltpc6fTif

10 Replies

Loading