Abstract: Feature distillation is a widely used training method to transfer feature information from a teacher to a student network. Current methods seek to minimize the reconstruction error of hidden feature maps between teacher-student models by explicitly optimizing distillation loss. However, some feature loss methods require complex transformations, which are not easy to optimize. In this paper, we propose a novel and effective feature distillation method, which learns to transfer knowledge by applying feature fusion as an alternative to distillation loss. Specifically, we fuse the intermediate feature of the student model to the attention teacher network, which has better representation and relatively less training cost. During training, this separable feature fusion can effectively transfer feature knowledge and is easy to optimize without complex transformation. After training, the feature fusion and the teacher network can be discarded, and the student network can be used separately in inference. Equipped with auxiliary classifier for ensemble logits distillation, our Separable Feature Knowledge Distillation (SFKD) obtains state-of-the-art performance. In experiments, SFKD achieves 4% performance improvement on CIFAR-100 and 2% on ImageNet for ResNet models, which substantially outperforms other feature distillation methods.
0 Replies
Loading