COMPRESSION AND ACCELERATION OF DEEP NEURAL NETWORKS: A VECTOR QUANTIZATION APPROACH

Mohammad Sadegh Norouzzadeh; Shahbaz Rezaei

COMPRESSION AND ACCELERATION OF DEEP NEURAL NETWORKS: A VECTOR QUANTIZATION APPROACH

Mohammad Sadegh Norouzzadeh, Shahbaz Rezaei

21 Sept 2023 (modified: 11 Feb 2024)Submitted to ICLR 2024EveryoneRevisionsBibTeX

Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Keywords: model compression, model acceleration, quantization

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.

Abstract: In the advancing field of deep learning, we witness the emergence of models that are getting larger, with an increasing number of parameters. However, this progress carries a downside, as it requires more powerful hardware, thereby restricting the utilization of deep learning models, particularly on edge devices. Hence, a vital requirement arises for compressing and accelerating deep learning models to enable their widespread deployment. Majority of recent studies proposed compression or acceleration based on pruning, low-precision quantization, matrix factorization and knowledge distillation. In this paper, we present a novel paradigm for compressing and accelerating deep learning models by harnessing vector quantization, a widely-recognized method in data compression. Our technique directly applies vector quantization to the neural network weights. More precisely, a VQ-DNN model divides weight parameters into equally sized segments, with the values of these segments exclusively derived from a compact codebook of values. During training, a VQ-DNN model learns both the codebook values and the mapping to model weight parameters. Our work demonstrates that vector quantization leads to more efficient implementations of matrix multiplications and convolution operations, ultimately reducing the computational cost. This efficiency enables us to accelerate and compress a wide range of models, including both Convolutional Neural Networks (CNNs) and vision transformers. We present experimental results on datasets such as CIFAR-10, ImageNet, and EuroSat using popular architectures like VGG16, ResNet, and ViT models. In all scenarios, VQ-DNN reduces model size by over 95\%, surpassing state-of-the-art methods. Furthermore, it achieves comparable or superior reductions in Floating Point Operations (FLOPs) compared to existing methods, contingent on the dataset and model configuration.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 4105

Loading