Abstract: Online Knowledge Distillation (KD) is an emerging paradigm capable of generating posterior knowledge without a pre-trained teacher. Online KD concurrently trains auxiliary teachers and a student, wherein the ensemble of teachers collaboratively guides the student’s learning trajectory. While diversity is crucial in ensemble learning, achieving it is a challenging task, particularly in multi-head structures where teachers share parts of their parameters. We propose a novel online KD framework designed to explicitly cultivate diverse teachers by exposing them to heterogeneous label distributions. This might seem infeasible because all teachers and the student in online KD use the same mini-batch for efficiency and knowledge transfer. Our key idea is the adoption of importance sampling, which enables teachers to experience diverse perspectives by controlling exposure to the data based on labels. To merge the knowledge of these teachers exposed to different label distributions, we employ post-compensating Softmax that adjusts the posteriors to compensate for the distorted prior. Extensive experimental analysis demonstrates the effectiveness of our approach in improving the student’s performance and enhancing its feature representations for downstream computer vision tasks.
External IDs:dblp:journals/kbs/ParkKBPRK25
Loading