Abstract: Knowledge distillation (KD) is an established paradigm for transferring privileged knowledge from a cumbersome model to a more lightweight and efficient one. In recent years, logit-based KD methods are quickly catching up in performance with their feature-based counterparts. However, existing research has pointed out that logit-based methods are still fundamentally limited by two major issues in their training process, namely overconfident teacher and confirmation bias. Inspired by the success of cross-view learning in fields such as semi-supervised learning, in this work we introduce within-view and cross-view regularisations to standard logit-based distillation frameworks to combat the above cruxes. We also perform confidence-based soft label selection to improve the quality of distilling signals from the teacher, which further mitigates the confirmation bias problem. Despite its apparent simplicity, the proposed Consistency-Regularisation-based Logit Distillation (CRLD) significantly boosts student learning, setting new state-of-the-art results on the standard CIFAR-100, Tiny-ImageNet, and ImageNet datasets across a diversity of teacher and student architectures, whilst introducing no extra network parameters. Orthogonal to on-going logit-based distillation research, our method enjoys excellent generalisation properties and, without bells and whistles, boosts the performance of various existing approaches by considerable margins. Our code and models will be released.
Primary Subject Area: [Content] Media Interpretation
Secondary Subject Area: [Systems] Systems and Middleware
Relevance To Conference: This work proposes a novel knowledge distillation method that helps transfer the knowledge of a more capable deep neural network model to a smaller, more lightweight, and deployment friendly one. The proposed method consistently boosts the performance of recently proposed knowledge distillation algorithms and achieves new state-of-the-art results on different datasets, different network combinations, and for a diversity of network architectures. Our method finds immense relevance and applications in model compression and algorithm implementation, especially on resource-constrained devices. The proposed method potentially benefits a wide range of multimedia processing tasks, such as visual and audio understanding, as well as the real-world deployment of such algorithms.
Supplementary Material: zip
Submission Number: 3011
Loading