Abstract: Deep Neural Network (DNN) models have brought significant performance gains to machine learning tasks. Nevertheless, the huge storage and computational costs of high-performance models severely limit their deployment on resource-limited embedded devices. Knowledge distillation (KD) is the mainstream approach for models compression, but the existing KD framework has some limitations. First, the student model underperforms when there is a considerable capacity gap between the student and teacher models, which reduces the model compression rate in knowledge distillation. Second, compared with training from scratch, the billions of additional forward propagation of the complex teacher model is time-consuming, resulting longer train time in the distillation process. This paper proposes AKD, an efficient difference-adaptive knowledge distillation framework. AKD consists of two modules. The distillation module uses a modified auxiliary teaching architecture as the backbone, and the adaptive module introduces a difference-adaptive method, which can adjust the teacher model adaptively according to the continuously improving representation ability of the student model. Experimental results on different kinds of datasets and models indicate that AKD not only improves the efficiency of distillation, but also enhances the performance of the compact student model and raise the compression rate.
Loading