- TL;DR: Inspired by trial-to-trial variability in the brain that can result from multiple noise sources, we introduce variability through noise in the knowledge distillation framework and studied their effect on generalization and robustness.
- Abstract: Knowledge distillation is an effective model compression technique in which a smaller model is trained to mimic a larger pretrained model. However in order to make these compact models suitable for real world deployment, not only do we need to reduce the performance gap but also we need to make them more robust to commonly occurring and adversarial perturbations. Noise permeates every level of the nervous system, from the perception of sensory signals to the generation of motor responses. We therefore believe that noise could be a crucial element in improving neural networks training and addressing the apparently contradictory goals of improving both the generalization and robustness of the model. Inspired by trial-to-trial variability in the brain that can result from multiple noise sources, we introduce variability through noise at either the input level or the supervision signals. Our results show that noise can improve both the generalization and robustness of the model. ”Fickle Teacher” which uses dropout in teacher model as a source of response variation leads to significant generalization improvement. ”Soft Randomization”, which matches the output distribution of the student model on the image with Gaussian noise to the output of the teacher on original image, improves the adversarial robustness manifolds compared to the student model trained with Gaussian noise. We further show the surprising effect of random label corruption on a model’s adversarial robustness. The study highlights the benefits of adding constructive noise in the knowledge distillation framework and hopes to inspire further work in the area.
- Keywords: Knowledge distillation, noise, generalization, adversarial robustness, natural robustness, robustness