Improving Language Model Distillation through Hidden State Matching

Sayantan Dasgupta; Trevor Cohn

Improving Language Model Distillation through Hidden State Matching

Sayantan Dasgupta, Trevor Cohn

Published: 22 Jan 2025, Last Modified: 30 Apr 2025ICLR 2025 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Knowledge Distillation, Centered Kernel Alignment, BART, mBART, T5

Abstract: Hidden State Matching is shown to improve knowledge distillation of language models by encouraging similarity between a student and its teacher's hidden states, as demonstrated by DistilBERT and its successors. This typically uses a cosine loss, which restricts the dimensionality of the student to the teacher's, severely limiting the compression ratio. We present an alternative technique using Centered Kernel Alignment (CKA) to match hidden states of different dimensionality, allowing for smaller students and higher compression ratios. We show the efficacy of our method using encoder--decoder (BART, mBART \& T5) and encoder-only (BERT) architectures across a range of tasks from classification to summarization and translation. Our technique is competitive with the current state-of-the-art distillation methods at comparable compression rates. It requires no pretrained student models, but rather can synthesize new student models from scratch through pretraining distillation. It can scale to students smaller than the current methods, is no slower in training and inference, and is considerably more flexible.

Primary Area: applications to computer vision, audio, language, and other modalities

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 1863

Loading