- TL;DR: We present an end-to-end design methodology for efficient deep learning deployment.
- Abstract: We present an end-to-end design methodology for efficient deep learning deployment. Unlike previous methods that separately optimize the neural network architecture, pruning policy, and quantization policy, we jointly optimize them in an end-to-end manner. To deal with the larger design space it brings, we train a quantization-aware accuracy predictor that fed to the evolutionary search to select the best fit. We first generate a large dataset of <NN architecture, ImageNet accuracy> pairs without training each architecture, but by sampling a unified supernet. Then we use these data to train an accuracy predictor without quantization, further using predictor-transfer technique to get the quantization-aware predictor, which reduces the amount of post-quantization fine-tuning time. Extensive experiments on ImageNet show the benefits of the end-to-end methodology: it maintains the same accuracy (75.1%) as ResNet34 float model while saving 2.2× BitOps comparing with the 8-bit model; we obtain the same level accuracy as MobileNetV2+HAQ while achieving 2×/1.3× latency/energy saving; the end-to-end optimization outperforms separate optimizations using ProxylessNAS+AMC+HAQ by 2.3% accuracy while reducing orders of magnitude GPU hours and CO2 emission.
- Keywords: End-to-end Design, Joint Optimization, Architecture Search, Network Pruning, Network Quanzation