Abstract: Pruning is an efficient model compression technique to remove redundancy in the connectivity of deep neural networks (DNNs). A critical problem to represent sparse matrices after pruning is that if fewer bits are used for quantization and pruning rate is enhanced, then the amount of index becomes relatively larger. Moreover, an irregular index form leads to low parallelism for convolutions and matrix multiplications. In this paper, we propose a new network pruning technique that generates a low-rank binary index matrix to compress index data significantly. Specifically, the proposed compression method finds a particular fine-grained pruning mask that can be decomposed into two binary matrices while decompressing index data is performed by simple binary matrix multiplication. We also propose a tile-based factorization technique that not only lowers memory requirements but also enhances compression ratio. Various DNN models (including conv layers and LSTM layers) can be pruned with much fewer indices compared to previous sparse matrix formats while maintaining the same pruning rate.
Keywords: Pruning, Model compression, Index compression, low-rank, binary matrix decomposition
TL;DR: We propose a new pruning technique to generate a low-rank binary index matrix.
Original Pdf: pdf
4 Replies
Loading