Reducing the Teacher-Student Gap via Elastic Student

Published: 01 Jan 2023, Last Modified: 18 Jun 2024KSEM (1) 2023EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: The application of knowledge distillation (KD) has shown promise in transferring knowledge from a larger teacher model to a smaller student model. Nevertheless, a prevalent phenomenon in knowledge distillation is that student performance decreases when the teacher-student gap becomes large. Our contention is that the degradation from teacher to student is predominantly attributable to two gaps, namely the capacity gap and the knowledge gap. In this paper, we introduce Elastic Student Knowledge Distillation (ESKD), an innovative method that comprises Elastic Architecture and Elastic Learning to bridge the two gaps. The Elastic Architecture temporarily increases the number of student’s parameters during training and subsequently reverts to its original size while inference. It improves the learning ability of the model without increasing the cost at inference time. The Elastic Learning strategy introduces mask matrix and progressive learning strategies that facilitates the student in comprehending the intricate knowledge of the teacher and accomplishing the effect of regularization. We conducted extensive experiments on CIFAR-100 and ImageNet datasets, demonstrating that ESKD outperforms existing methods while preserving computational efficiency.
Loading