Decoupled Multi-teacher Knowledge Distillation based on Entropy

Xin Cheng, Jialiang Tang, Zhiqiang Zhang, Wenxin Yu, Ning Jiang, Jinjia Zhou

Published: 01 Jan 2024, Last Modified: 06 Jun 2025ISCAS 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Multi-teacher knowledge distillation (MKD) aims to leverage the valuable and diverse knowledge presented by multiple teacher networks to improve the performance of the student network. Existing approaches typically rely on simple methods such as averaging the prediction logits or using sub-optimal weighting strategies to combine knowledge from multiple teachers. However, employing these techniques cannot fully reflect the importance of teachers and may even mislead student’s learning. To address these issues, we propose a novel Decoupled Multi-teacher Knowledge Distillation based on Entropy (DE-MKD). DE-MKD decomposes the vanilla KD loss and assigns weights to each teacher to reflect its importance based on the entropy of their predictions. Furthermore, we extend the proposed approach to distill the intermediate features from teachers to further improve the performance of the student network. Extensive experiments conducted on the publicly available CIFAR-100 image classification dataset demonstrate the effectiveness and flexibility of our proposed approach.