Multi-Teacher Knowledge Distillation with Clustering-Based Sentence Pruning for Efficient Student Models
Abstract: Transformer-based encoder models such as BERT and RoBERTa perform well on NLP tasks but are computationally intensive for deployment. We propose Clustering-Based Knowledge Distillation with Sentence Pruning, a novel framework that combines multi-teacher distillation with structure-aware sentence selection to improve student model efficiency. Our method integrates teacher outputs via validation-aware ensembling and prunes redundant sentences using semantic similarity and TF-IDF-based scoring. Experiments across GLUE, AG News, and PubMed RCT demonstrate that our method consistently enhances student model performance, achieving 95.4% accuracy on SST-2, the highest accuracy on AG News (91.14%) and PubMed RCT (78.00%), and improved accuracy on RTE through sentence pruning. Ablation studies confirm the effectiveness of jointly applying clustering and pruning. Our framework offers a practical and scalable solution for deploying compact models in resource-limited environments.
Paper Type: Long
Research Area: Efficient/Low-Resource Methods for NLP
Research Area Keywords: quantization, pruning, distillation, parameter-efficient-training, data-efficient training, efficient inference, NLP in resource-constrained settings
Contribution Types: Publicly available software and/or pre-trained models
Languages Studied: English
Keywords: quantization, pruning, distillation, parameter-efficient-training, data-efficient training, efficient inference, NLP in resource-constrained settings
Submission Number: 6096
Loading