Distill Vision Transformers to CNNs via Low-Rank Representation Approximation

Xufeng Yao; Yuechen ZHANG; Zuyao Chen; Jiaya Jia; Bei Yu

Distill Vision Transformers to CNNs via Low-Rank Representation Approximation

Xufeng Yao, Yuechen ZHANG, Zuyao Chen, Jiaya Jia, Bei Yu

22 Sept 2022 (modified: 13 Feb 2023)ICLR 2023 Conference Withdrawn SubmissionReaders: Everyone

Keywords: Knowledge Distillation, Low rank approximation, Transformer, Representation Learning

TL;DR: Distill Vision Transformers to CNNs via Low-Rank Representation Approximation

Abstract: Vision Transformers attain state-of-the-art performance in diverse vision tasks due to their scalable and long-range dependencies modeling. Meanwhile, CNNs are still practical and efficient in many industry scenarios, thanks to their inductive biases and mature tiny architectures. Thus it is a challenging yet interesting problem to study the Knowledge Distillation (KD) of these two different architectures. In particular, how to transfer global information from Vision Transformers to tiny CNNs. We point out that many current CNN distillation methods are ineffective in the Vision Transformers distillation scenario, which implies that distilling global information is not easy due to the architecture gaps. We develop an encoder-decoder representation distillation framework, namely \textbf{L}ow \textbf{R}ank \textbf{R}epresentation \textbf{A}pproximation, to address the problem. The Key insight of LRRA is that the global information modeling can be seen as finding the most important bases and corresponding codes. This process can be solved by matrix decomposition. Specifically, the student representation is encoded to a low-rank latent representation and used to approximate the teacher representation. The most distinguishable knowledge, i.e., global information, is distilled via the low-rank representation approximation. The proposed method offers a potential closed-form solution without introducing extra learnable parameters and hand-crafted engineering. We benchmark 11 KD methods to demonstrate the usefulness of our approach. Extensive ablation studies validate the necessity of the low-rank structure.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics

Submission Guidelines: Yes

Please Choose The Closest Area That Your Submission Falls Into: Deep Learning and representational learning

5 Replies

Loading