Generalization Self-distillation with Epoch-wise Regularization

Yuelong Xia, Yun Yang

Published: 2021, Last Modified: 15 May 2025IJCNN 2021EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Recent advances in deep neural network have achieved remarkable successes in various computer vision tasks. However, deep neural network with millions of parameters may suffer from poor generalization due to overfitting. To improve the generalization performance, many methods have been proposed such as data augmentation, label smoothing and knowledge distillation. In this paper, we extend self-knowledge distillation to enhance the model generalization performance without incurring extra computation cost, called EWR-KD, which is a simple yet effective method to progressively distill knowledge from the model itself. Concretely, it consists of two components: 1) the self-distillation scheme that progressively softens the learning targets by using the past model prediction; 2) the sample-reweighting scheme that dynamically decides the trust degree to transfer more informative knowledge by introducing uncertainty estimation. With the two components, EWR-KD is robust to both corrupt noises and adversarial noises, and can be easily combined with current advanced regularization techniques. We theoretically show that EWR-KD minimizes cross-entropy by adding an epoch-wise regularization, which measures the difference between the past model prediction at ($t-1$)-th epoch and the current prediction at $t$-th epoch. Finally, Extensive experimental results on clean datasets and noisy datasets empirically demonstrate that EWR-KD not only improves the performance of the state-of-the-art baseline but also yields well calibration.