- Abstract: Large number of weights in deep neural networks make the models difficult to be deployed in low memory environments such as, mobile phones, IOT edge devices as well as "inferencing as a service" environments on the cloud. Prior work has considered reduction in the size of the models, through compression techniques like weight pruning, filter pruning, etc. or through low-rank decomposition of the convolution layers. In this paper, we demonstrate the use of multiple techniques to achieve not only higher model compression but also reduce the compute resources required during inferencing. We do filter pruning followed by low-rank decomposition using Tucker decomposition for model compression. We show that our approach achieves upto 57\% higher model compression when compared to either Tucker Decomposition or Filter pruning alone at similar accuracy for GoogleNet. Also, it reduces the Flops by upto 48\% thereby making the inferencing faster.
- TL;DR: Combining orthogonal model compression techniques to get significant reduction in model size and number of flops required during inferencing.
- Keywords: Deep Learning, Model compression, Filter Pruning, Low-rank Decomposition, Tucker Decomposition