Teacher Intervention: Improving Convergence of Quantization Aware Training for Ultra-Low Precision TransformersDownload PDF

22 Sept 2022 (modified: 13 Feb 2023)ICLR 2023 Conference Withdrawn SubmissionReaders: Everyone
Keywords: Deep Learning, Quantization, QAT, Self-Attention, Transformer, BERT
TL;DR: Efficient and accurate two-step Quantization Aware Training method of Finetuned Transformers
Abstract: Pre-trained Transformer models such as BERT have shown great success in a wide range of applications, but with the cost of substantial increases in model complexity. Quantization-aware training (QAT) is a promising way to lower the implementation cost and energy consumption. However, aggressive quantization below 2-bit causes considerable accuracy degradation, especially when the downstream dataset is not abundant. This work proposes a proactive knowledge distillation method called Teacher Intervention (TI) for fast converging QAT of ultra-low precision pre-trained Transformers. TI intervenes layer-wise signal propagation with the intact signal from the teacher to remove the interference of propagated quantization errors, smoothing loss surface and expediting the convergence. We further propose a gradual intervention mechanism to stabilize the tuning of the feed-forward network and recover the self-attention map in steps. The proposed scheme enables fast convergence of QAT and improves the model accuracy regardless of the diverse characteristics of downstream fine-tuning tasks. We demonstrate that TI consistently achieves superior accuracy with lower fine-tuning budget.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics
Submission Guidelines: Yes
Please Choose The Closest Area That Your Submission Falls Into: Deep Learning and representational learning
1 Reply

Loading