Cross-Architecture Knowledge Distillation

Published: 01 Jan 2024, Last Modified: 13 Nov 2024Int. J. Comput. Vis. 2024EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: The Transformer network architecture has gained attention due to its ability to learn global relations and its superior performance. To boost performance, it is natural to distill complementary knowledge from a Transformer network to a convolutional neural network (CNN). However, most existing knowledge distillation methods only consider homologous-architecture distillation, which may not be suitable for cross-architecture scenarios, such as from Transformer to CNN. To address this problem, we analyze the globality and transferability of models, which reflect the ability to capture global knowledge and transfer knowledge from teacher to student, respectively. Inspired by our observations, a novel cross-architecture knowledge distillation method is proposed, which supports bi-directional distillation including from Transformer to CNN and from CNN to Transformer. Specifically, rather than directly mimicking the output and intermediate features of the teacher, a partial cross-attention projector (PCA/iPCA) and a group-wise linear projector (GL/iGL) are introduced to align the student features with the teacher’s in two projected feature spaces. To better match the teacher’s knowledge with the student’s knowledge, an adaptive distillation router (ADR) is presented to decide the knowledge from which layer the teacher should be distilled to guide which layer of the student. A multi-view robust training scheme is further presented, to improve the robustness of the framework for distillation. Extensive experiments show that the proposed method outperforms 17 state-of-the-art methods on both small-scale and large-scale datasets.
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview