Abstract: High-quality models across various natural language processing tasks, such as summarization and chatbots, often rely on large architectures, making them computationally intensive and challenging to deploy in resource-constrained environments. While knowledge distillation enables smaller student models to approximate the performance of larger teacher models, existing methods frequently encounter significant trade-offs between accuracy and efficiency. Additionally, uncertain predictions from teacher models can negatively impact the student’s learning process. In this paper, we introduce CAKD, a novel approach that optimizes the training of student models by selectively emphasizing the teacher model's most reliable predictions using confidence scores. By integrating entropy-based confidence weighting into the distillation loss, CAKD effectively prioritizes high-confidence samples, resulting in improved performance and efficiency. Our experiments on text summarization (using a BART-based model on the CNN/DM dataset) and chatbot tasks (using Llama-based model on the DailyDialog and PersonaChat datasets) demonstrate that CAKD achieves significant performance gains over larger teacher models, with improvements of 10.53, 2.1 and 0.38 ROUGE-L points respectively.
Paper Type: Long
Research Area: Machine Learning for NLP
Research Area Keywords: Knowledge Distillation, Chatbot, Summarization
Languages Studied: English
Submission Number: 4425
Loading