Track: tiny / short paper (up to 4 pages)
Keywords: LLMs, Task-Based Knowledge Distillation, Gradient Attribution, Knowledge Localization, Selective Distillation
TL;DR: We propose task-aware selective KD (TASKD-LLM), a novel approach that transfers only task-relevant knowledge from the teacher to the student model.
Abstract: Large language models achieved state-of-the-art performance in generative tasks but are computationally expensive, making them impractical for deployment in resource-constrained environments. Knowledge distillation (KD) is a promising technique for compressing LLMs by transferring knowledge from a large teacher to a more efficient student model. However, existing task-based KD methods distill all teacher model components indiscriminately. Since teacher models are typically pre-trained for versatility across a broad range of tasks, this approach can introduce unnecessary complexity when distilling for a specific downstream task, potentially limiting the student's ability to specialize. Furthermore, previous work showed that only a subset of the LLM components significantly contribute to a given task, making indiscriminate distillation inefficient. Motivated by these insights, we propose task-aware selective KD (TASKD-LLM), a novel approach that transfers only task-relevant knowledge from the teacher to the student, simplifying the distillation process and maintaining the student's focus. Our method is flexible and can be combined with other distillation techniques in a plug-and-play manner. Empirical results demonstrate that TASKD-LLM outperforms existing methods, achieving higher performance on several benchmark datasets.
Submission Number: 74
Loading