Abstract: Knowledge distillation efficiently improves a small model’s performance by mimicking the teacher model’s behavior. Most existing methods assume that distilling from a large and accurate teacher model leads to better student models. However, several studies show the difficulty of distillation from large teacher models and opt for heuristics to address this. In this work, we demonstrate that large teacher models can still be effective in knowledge distillation. We show that the spurious features learned by large models are the cause of difficulty in knowledge distillation for small students. To overcome this issue, we propose employing ℓ 1 regularization to prevent teacher models from learning an excessive number of spurious features. Our method alleviates the poor learning for small students when there is a significant disparity in size between the teachers and students. We achieve substantial improvement on various architectures, e.g., ResNet, WideResNet, and VGG, on diverse datasets, including CIFAR-100, Tiny-ImageNet, and ImageNet.
Loading