Improving Weight-Inherited Distillation with Data-aware Initialization and Structural AdaptationDownload PDF

Anonymous

16 Oct 2023ACL ARR 2023 October Blind SubmissionReaders: Everyone
Abstract: Weight-Inherited Distillation (WID) is an effective distillation method that inherits the weights from the teacher's model, thus achieving better results than traditional distillation methods. However, the identity matrix initialization used in WID leads to slow model convergence. In this work, we propose an improved WID method named \smodel that replaces the identity matrix initialization with a specialized data-aware initialization. At the same time, we also improve the model structure of WID, which makes WID more flexible and adaptive in choosing the compressed model structure. Our experiments on the GLUE and SQuAD datasets show that the model delivered by \smodel retains 96\% of the performance with 94\% of parameters removed, showing its effectiveness compared to previous pruning and distillation methods.
Paper Type: long
Research Area: Efficient/Low-Resource Methods for NLP
Contribution Types: NLP engineering experiment, Approaches to low-resource settings, Approaches low compute settings-efficiency
Languages Studied: English
Consent To Share Submission Details: On behalf of all authors, we agree to the terms above to share our submission details.
0 Replies

Loading