Calibrated Offline Knowledge Distillation for Large Language Model TrainingDownload PDF

Anonymous

16 Feb 2024ACL ARR 2024 February Blind SubmissionReaders: Everyone
Abstract: Knowledge Distillation (KD) in large language models (LLMs) which involves training a small model to mimic the behaviour of a large model by matching their output distribution, has shown remarkable improvement in performance and efficiency over standard fine-tuning.Despite the great success of these methods, distilled student models are still suffering from catastrophic mis-calibration due to the over-confident nature of the teacher model.In this paper, we present a comprehensive study on the importance and necessity of re-calibration during soft-label-based distillation. We further propose a soft-label-based Calibrated Offline Knowledge Distillation (COD) pipeline that can effectively determine to what extent different token probability should be reduced or raised, resulting in a consistent distillation of a reliable model. Specifically, we start by re-calibrating the token probability distribution generated by the teacher model, by reducing the probability of over-confident tokens and raising the under-confident ones. Then we train a student model to fit the calibrated distribution. We conduct extensive experiments on both in-domain and out-of-domain settings by comparing re-calibrated distillation with non-calibrated distillation and standard fin-tuning over three popular open-sourced language model family (Llama-1, Llama-2, and OpenLlama). Experimental results demonstrate that re-calibration before distillation can greatly improve the reliability of the model (by 4.3\% expected calibration error on average) and generally further boost the downstream performance (by 2.5\% accuracy on average).
Paper Type: long
Research Area: NLP Applications
Contribution Types: NLP engineering experiment, Approaches to low-resource settings, Approaches low compute settings-efficiency
Languages Studied: English
0 Replies

Loading