Efficient Knowledge Distillation via Salient Feature Masking

TMLR Paper5285 Authors

03 Jul 2025 (modified: 21 Jul 2025)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: Traditional Knowledge Distillation (KD) transfers all outputs from a teacher model to a student model, often introducing knowledge redundancy. This redundancy dilutes critical information, leading to degraded student model performance. To address this, we propose Salient Feature Masking for Knowledge Distillation (SFKD), where only the most informative features are selectively distilled, enhancing student performance. Our approach is grounded in the Information Bottleneck (IB) principle, where focusing on features with higher mutual information with the input leads to more effective distillation. SFKD integrates with existing KD variants and enhances the transfer of ``dark knowledge''. It consistently improves image classification accuracy across diverse models, including ConvNeXt and ViT, achieving gains of 5.44\% on CIFAR-100 and 3.57\% on ImageNet-1K. When combined with current KD methods, SFKD outperforms state-of-the-art results by 1.47\%.
Submission Length: Regular submission (no more than 12 pages of main content)
Assigned Action Editor: ~Martha_White1
Submission Number: 5285
Loading