- Keywords: Deep Learning, Compression model, Knowledge distillation, Pruning, Quantization, Computer vision
- TL;DR: Deep Learning Models for edge devices
- Abstract: AbstractModern Deep Neural Networks (DNNs) models used in computer vision applications are compelling.They are widely used to solve a variety of problems and the increase in data size implies that the modelcould be very large and complex, and therefore increased in computational requirements. The number ofparameters in recent state-of-the-art networks makes them difficult to deploy in edges devices like mobilephones, watches or drones where memory and energy are limited. We are working on the implementationof techniques that significantly reduce the size of a very large and powerful vision model while preservingas much of its performance as possible. We built classification models on the MNIST dataset and used itwith pre-trained models on ImageNet on the Cats & Dogs dataset. We performed a closer examinationof the effectiveness (mathematical and implementation aspects) of Knowledge distillation (KD), Pruningand Quantization techniques. Firstly, we implemented transfer learning which consists on modifying theparameters of an already-trained network to adapt to a new task on a new dataset, then secondly,we trained this network by using a gradual pruning approach that requires minimal tuning and can beseamlessly incorporated within the training process. Thirdly, the Quantization has helped us reducethe number of bits required to represent each parameter from 32-bit floats to 8 bits. We significantlyreduced bandwidth and storage. On MNIST, we reduced the model from 12.52 MB to 0.57 MB withno loss of accuracy. After the transfer learning and pruning step, we reduced the MobileNet from 12.48MB with an accuracy of 0.9556 to 2.91 MB with an accuracy of 0.9516. We also empirically show ourmethod’s adaptability for classification based architecture VGG16 and VGG19 on datasets Cat & Dogsobserving that the entire pruning pipeline plus post-quantification at 8 bits works well up to 70% level ofsparsity, suffer only very small losses in accuracy and the size of the model obtained by transfer learningare divided by 10.