TL;DR: We explore the knowledge distillation process from a functional perspective to gain nuanced insight into how it works.
Abstract: Empirical findings of accuracy correlations between students and teachers in the knowledge distillation framework have served as supporting evidence for knowl- edge transfer. In this paper, we sought to explain and understand the knowledge transfer derived from knowledge distillation via functional similarity, hypothesising that knowledge distillation provides a functionally similar student to its teacher model. While we accept this hypothesis for two out of three architectures across a range of metrics for functional analysis against four controls, the results show that knowledge transfer is significant but it is less pronounced than expected for conditions that maximise opportunities for functional similarity. Furthermore, results from the use of Uniform and Gaussian Noise as teachers suggest that the knowledge-sharing aspects of knowledge distillation inadequately describe the accuracy benefits witnessed when using the knowledge distillation training setup itself. Moreover, in the first instance, we show that knowledge distillation is not a compression mechanism but primarily a data-dependent training regulariser with a small capacity to transfer knowledge in the best case.
Style Files: I have used the style files.
Debunking Challenge: This submission is an entry to the debunking challenge.
Submission Number: 71
Loading