Quantization Variation: A New Perspective on Training Transformers with Low-Bit Precision

Published: 12 Oct 2024, Last Modified: 12 Oct 2024Accepted by TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: Despite the outstanding performance of transformers in both language and vision tasks, the expanding computation and model size have increased the demand for efficient deployment. To address the heavy computation and parameter drawbacks, quantization is frequently studied in the community as a representative model compression technique and has seen extensive use on ConvNets. However, due to the unique properties of transformers, the low-bit quantization applications are still limited and underexplored. In this paper, we identify the difficulty of transformer low-bit quantization-aware training on its unique variation behaviors, which significantly differ from ConvNets. Based on comprehensive quantitative analysis, we observe variation in three hierarchies: various module quantization sensitivities, outliers in static weight and activation distribution, and oscillation in dynamic parameter fluctuations. These variations of transformers bring instability to the quantization-aware training (QAT) and negatively influence the performance. We explore the best practices to alleviate the variation's influence during low-bit transformer QAT and propose a variation-aware quantization scheme for both vision and language transformers. We extensively verify and show our scheme can alleviate the variation and improve the performance of transformers across various models and tasks. Our solution substantially improves the 2-bit Swin-T and binary BERT-base, achieving a 3.35% and 1.4% accuracy improvement over previous state-of-the-art methods on ImageNet-1K and GLUE.
Submission Length: Regular submission (no more than 12 pages of main content)
Previous TMLR Submission Url: https://openreview.net/forum?id=yVytaiJzkq
Changes Since Last Submission: We have fully revised our manuscripts according to the review and made the following changes since the last submission: - We rewrite the section on knowledge distillation to emphasize how our method utilizes the multi-crop KD and highlight our contribution to reveal the effectiveness of multi-crop KD in reducing variation in the quantization of transformer-based models. - We extend the model to general transformers, including vision transformers (DeiT, SReT, Swin Transformer) and language transformers (BERT). We also add the additional experiments on larger-scale ViT models in this work. - We include all the additional results from the previous rebuttal in the updated version and fully polish the presentation, revising all the typos and grammar errors to avoid confusion. - We further discuss the possibility of generalization to CNNs, showing the reason why our method is transformer-specific. (Appendix F) - We provide more hardware efficiency experiments, including the hardware utilization comparison with mixed-precision counterparts in Table 8, memory consumption, and training time per epoch in Table 13. It is noticeable that we implemented the MAC units by Verilog HDL and compared them in terms of area and power dissipation. - We include a discussion of the recent research in Appendix G, especially the outlier and quantization sensitivity research of Large Language Models (LLMs).
Code: https://github.com/HuangOwen/Quantization-Variation
Supplementary Material: zip
Assigned Action Editor: ~Naigang_Wang1
Submission Number: 3006
Loading