Multi-Teacher Knowledge Distillation with Clustering-Based Sentence Pruning for Efficient Student Models

ACL ARR 2025 May Submission6096 Authors

20 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Transformer-based encoder models such as BERT and RoBERTa perform well on NLP tasks but are computationally intensive for deployment. We propose Clustering-Based Knowledge Distillation with Sentence Pruning, a novel framework that combines multi-teacher distillation with structure-aware sentence selection to improve student model efficiency. Our method integrates teacher outputs via validation-aware ensembling and prunes redundant sentences using semantic similarity and TF-IDF-based scoring. Experiments across GLUE, AG News, and PubMed RCT demonstrate that our method consistently enhances student model performance, achieving 95.4% accuracy on SST-2, the highest accuracy on AG News (91.14%) and PubMed RCT (78.00%), and improved accuracy on RTE through sentence pruning. Ablation studies confirm the effectiveness of jointly applying clustering and pruning. Our framework offers a practical and scalable solution for deploying compact models in resource-limited environments.
Paper Type: Long
Research Area: Efficient/Low-Resource Methods for NLP
Research Area Keywords: quantization, pruning, distillation, parameter-efficient-training, data-efficient training, efficient inference, NLP in resource-constrained settings
Contribution Types: Publicly available software and/or pre-trained models
Languages Studied: English
Keywords: quantization, pruning, distillation, parameter-efficient-training, data-efficient training, efficient inference, NLP in resource-constrained settings
Submission Number: 6096
Loading