Dropout Connects Transformers and CNNs: Transfer General Knowledge for Knowledge Distillation

Bokyeung Lee, Jonghwan Hong, Hyunuk Shin, Bonhwa Ku, Hanseok Ko

Published: 01 Jan 2025, Last Modified: 22 Oct 2025WACV 2025EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Thanks to their long-range dependencies, transformers obtain state-of-the-art performance in diverse research fields such as computer vision and audio processing. In practical scenarios, convolutional neural networks (CNNs) are used more than Transformers due to their low complexity. So, Transformer-to-CNN knowledge distillation (KD) research, where the Transformer is the teacher and the CNN is the student, is in demand and receiving attention. In Transformer-to-CNN KD training, the capacity gap problem arising from structural differences between the teacher and student networks is the main factor of performance degradation of the student network, unlike homogenous architecture KD. However, previous KD studies transfer all of a teacher's knowledge to the student without consid-ering structural differences. They cannot overcome problems caused by structural differences and show poor performance in Transformer-to-CNN KD. In this paper, we iden-tify general and specific knowledge in feature maps of the teacher and student. General and specific knowledge are the generalized and non-generalized feature representation. We propose a novel KD framework DropKD, which extracts general knowledge from the teacher and student while re-moving specific knowledge and then allows general knowledge of the student network to learn general knowledge of the teacher. Our DropKD empowers the student network to achieve generalization by effectively managing general and specific knowledge. Through extensive experiments on challenging image classification datasets, we demonstrate that the proposed method is superior to existing methods.

External IDs:dblp:conf/wacv/LeeHSKK25