Surpass Teacher: Enlightenment Structured Knowledge Distillation of Transformer

Ran Yang, Chunhua Deng

Published: 01 Jan 2023, Last Modified: 06 Jun 2025SMC 2023EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: It is difficult to train a trustworthy transformer model on a small image classification dataset. This research proposes a sophisticated structured knowledge distillation algorithm that uses CNNs as Transformer's sophisticated teachers, significantly lowering the number of training datasets needed. To better to develop the potential for CNN tutors, this research configures a public data set for CNN teaching as an enlightenment textbook to guide Transformer's training and avoid falling into local optimization prematurely. The distillation process then employs a “learn-digest-self-distillation” learning strategy to enable the Transformer to assimilate CNN knowledge in a structured manner. Sufficient experiments show that the proposed method is significantly better than the direct training Transformer under the condition of limited data sets. Moreover, in order to show the practical application value, this research contributed a practical data set for the classification of smoking and calling. The corresponding code and dataset will be released at https://gitee.com/wustdch/surpass-teacher if this paper is accepted.