Boosting Online Feature Transfer via Separable Feature Fusion

Lujun Li, Shiuan-Ni Liang, Ya Yang, Zhe Jin

2022 (modified: 10 Nov 2022)IJCNN 2022Readers: Everyone

Abstract: Feature distillation is a widely used training method to transfer feature information from a teacher to a student network. Current methods seek to minimize the reconstruction error of hidden feature maps between teacher-student models by explicitly optimizing distillation loss. However, some feature loss methods require complex transformations, which are not easy to optimize. In this paper, we propose a novel and effective feature distillation method, which learns to transfer knowledge by applying feature fusion as an alternative to distillation loss. Specifically, we fuse the intermediate feature of the student model to the attention teacher network, which has better representation and relatively less training cost. During training, this separable feature fusion can effectively transfer feature knowledge and is easy to optimize without complex transformation. After training, the feature fusion and the teacher network can be discarded, and the student network can be used separately in inference. Equipped with auxiliary classifier for ensemble logits distillation, our Separable Feature Knowledge Distillation (SFKD) obtains state-of-the-art performance. In experiments, SFKD achieves 4% performance improvement on CIFAR-100 and 2% on ImageNet for ResNet models, which substantially outperforms other feature distillation methods.

0 Replies