Efficient knowledge distillation via salient feature masking

Assel Kembay, Skye Gunasekaran, Rui-Jie Zhu, Yu Zhang, Jason Eshraghian

Published: 06 Feb 2026, Last Modified: 08 May 2026APL Machine LearningEveryoneCC BY 4.0

Abstract: Traditional Knowledge Distillation (KD) transfers all outputs from a teacher model to a student model, often introducing knowledge redundancy. This redundancy dilutes critical information, leading to degraded student model performance. To address this, we propose Salient Feature Masking for Knowledge Distillation (SFKD), a lightweight enhancement that masks out less informative components and selectively distills only the top-K activations. SFKD is a drop-in modification applicable to both logit-based and feature-based KD, incurs negligible overhead, and sharpens the student’s learning signal. Empirically, SFKD yields consistent gains over strong KD baselines across architectures (ConvNeXt, ViT) and datasets (CIFAR-100: up to +2.43 pp; CUB-200: up to +6.39 pp; ImageNet-1K: up to +3.57 pp). We also provide intuition from the information bottleneck perspective to motivate why filtering out less salient teacher signals benefits the student. Overall, SFKD is a simple, empirically validated method for training student models that are both leaner and more accurate.