Abstract: Pre-trained language models encounter a bottleneck in production due to their high computational cost. Model compression methods have emerged as critical technologies for overcoming this bottleneck. As a popular compression method, knowledge distillation transfers knowledge from a large (teacher) model to a small (student) one. However, existing methods perform distillation on the entire data, which easily leads to repetitive learning for the student. Furthermore, the capacity gap between the teacher and student hinders knowledge transfer. To address these issues, we propose the Data-efficient Knowledge Distillation (DeKD) with teacher assistant-based dynamic objective alignment, which empowers the student to dynamically adjust the learning process. Specifically, we first design an entropy-based strategy to select informative instances at the data level, which can reduce the learning from the mastered instances for the student. Next, we introduce the teacher assistant as an auxiliary model for the student at the model level to mitigate the degradation of distillation performance. Finally, we further develop the mechanism of dynamically aligning intermediate representations of the teacher to ensure effective knowledge transfer at the objective level. Extensive experiments on the benchmark datasets show that our method outperforms the state-of-the-art methods.
Loading