Co-training and Co-distillation for Quality Improvement and Compression of Language ModelsDownload PDF

Anonymous

16 Dec 2022 (modified: 05 May 2023)ACL ARR 2022 December Blind SubmissionReaders: Everyone
Abstract: The increasing size and computational complexity of pre-trained language models (PLMs) can make them difficult to use in serving downstream tasks under resource-constrained environments. While conventional knowledge distillation (KD) can compress PLMs by transferring knowledge from a larger model to a smaller model, this typically results in a trade-off between inference efficiency and performance. In this work, we propose a novel Co-Training and Co-Distillation (CTCD) framework that improves the quality of PLMs while increasing their inference efficiency. Our approach trains different-sized models together and allows them to distill knowledge from each other. We demonstrate that the proposed co-distillation improves the quality of the teacher model, which in turn improves the quality of the student model. Furthermore, we introduce a Community KD that consists of a single teacher and two students where each student learns from the teacher and the other student. The results of our experiments show that the Community KD is effective at compressing PLMs and improving their quality, outperforming the conventional one-way KD method by 1.2% on the GLUE benchmark.
Paper Type: long
Research Area: Efficient Methods for NLP
0 Replies

Loading