Online Knowledge Distillation with History-Aware Teachers

Published: 18 Jul 2022, Last Modified: 23 Jul 2025OpenReview Archive Direct UploadEveryoneCC BY 4.0
Abstract: In this work, we propose a novel online knowledge distillation (OKD) approach, built upon the classical deep mutual learning framework in which peer networks (students) treat each other as teachers by learning from their predictions. The proposed method traces and leverages two levels of information encoded in each peer's learning trajectory to dynamically construct superior teachers to supervise other students. We first build a recurrent neural network associated with each peer, which takes both the network's current and previous logits as input and outputs integrated logits with the same dimension as the transferred knowledge. By doing so, the teachers provide an enhanced representation of knowledge. Beyond that, we also build a weight-averaged surrogate for each network, which maintains the exponential moving average of its learned parameters during the online training procedure. The proposed approach exploits the hidden information behind the online learning process instead of myopically learning from peers' outputs at a single time/iteration step. It potentially reduces uncertainties from peers as suffered in previous OKD studies with more stabilized transferred knowledge. We evaluate the proposed approach with benchmark image classification datasets and network architectures. Experimental results demonstrate its effectiveness with clear performance improvement over state-of-the-arts.
Loading