Improving Knowledge Distillation via Regularizing Feature Direction and Norm

Yuzhu Wang; Lechao Cheng; Manni Duan; Yongheng Wang; Zunlei Feng; Shu Kong

Improving Knowledge Distillation via Regularizing Feature Direction and Norm

Yuzhu Wang, Lechao Cheng, Manni Duan, Yongheng Wang, Zunlei Feng, Shu Kong

15 Sept 2023 (modified: 25 Mar 2024)ICLR 2024 Conference Withdrawn SubmissionEveryoneRevisionsBibTeX

Keywords: knowledge distillation, feature norm, feature direction, network compression, feature regularization, student training

Abstract: Knowledge distillation (KD) exploits a large well-trained {\tt teacher} neural network to train a small {\tt student} network on the same dataset for the same task. Treating {\tt teacher}'s feature as knowledge, prevailing methods train {\tt student} by aligning its features with the {\tt teacher}'s, e.g., by minimizing the KL-divergence between their logits or L2 distance between their features at intermediate layers. While it is natural to assume that better feature alignment helps distill {\tt teacher}'s knowledge, simply forcing this alignment does not directly contribute to the {\tt student}'s performance, e.g., classification accuracy. For example, minimizing the L2 distance between the penultimate-layer features (used to compute logits for classification) does not necessarily help learn a better {\tt student}-classifier. Therefore, we are motivated to regularize {\tt student} features at the penultimate layer using {\tt teacher} towards training a better {\tt student} classifier. Specifically, we present a rather simple method that uses {\tt teacher}'s class-mean features to align {\tt student} features w.r.t their {\em direction}. Experiments show that this significantly improves KD performance. Moreover, we empirically find that {\tt student} produces features that have notably smaller norms than {\tt teacher}'s, motivating us to regularize {\tt student} to produce large-norm features. Experiments show that doing so also yields better performance. Finally, we present a simple loss as our main technical contribution that regularizes {\tt student} by simultaneously (1) aligning the \emph{direction} of its features with the {\tt teacher} class-mean feature, and (2) encouraging it to produce large-\emph{norm} features. Experiments on standard benchmarks demonstrate that adopting our technique remarkably improves existing KD methods, achieving the state-of-the-art KD performance through the lens of image classification (on ImageNet and CIFAR100 datasets) and object detection (on the COCO dataset).

Supplementary Material: zip

Primary Area: general machine learning (i.e., none of the above)

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 96

Loading