Multi-Teacher Knowledge Distillation with Clustering-Based Sentence  Pruning for Efficient Student Models

Multi-Teacher Knowledge Distillation with Clustering-Based Sentence Pruning for Efficient Student Models

ACL ARR 2025 May Submission6096 Authors

20 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Transformer-based encoder models such as BERT and RoBERTa perform well on NLP tasks but are computationally intensive for deployment. We propose Clustering-Based Knowledge Distillation with Sentence Pruning, a novel framework that combines multi-teacher distillation with structure-aware sentence selection to improve student model efficiency. Our method integrates teacher outputs via validation-aware ensembling and prunes redundant sentences using semantic similarity and TF-IDF-based scoring. Experiments across GLUE, AG News, and PubMed RCT demonstrate that our method consistently enhances student model performance, achieving 95.4% accuracy on SST-2, the highest accuracy on AG News (91.14%) and PubMed RCT (78.00%), and improved accuracy on RTE through sentence pruning. Ablation studies confirm the effectiveness of jointly applying clustering and pruning. Our framework offers a practical and scalable solution for deploying compact models in resource-limited environments.

Paper Type: Long

Research Area: Efficient/Low-Resource Methods for NLP

Research Area Keywords: quantization, pruning, distillation, parameter-efficient-training, data-efficient training, efficient inference, NLP in resource-constrained settings

Contribution Types: Publicly available software and/or pre-trained models

Languages Studied: English

Keywords: quantization, pruning, distillation, parameter-efficient-training, data-efficient training, efficient inference, NLP in resource-constrained settings

Submission Number: 6096

Loading