Knowledge Distillation through Representational Alignment

Knowledge Distillation through Representational Alignment

ACL ARR 2024 August Submission307 Authors

16 Aug 2024 (modified: 20 Sept 2024)ACL ARR 2024 August SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Knowledge distillation is a common paradigm for transferring capabilities from a larger model to smaller models. Assuming white box access to the larger model, traditional knowledge distillation methods often draw a probabilistic measure over the activations and minimize a divergence measure between the larger and smaller model. These methods are often limited to last-layer activations, and do not leverage any meaningful information from representations included in the hidden layers. In this work, we propose a distillation method that explicitly utilizes popular measures of representational alignment: CKA and Shape. We show that our method yields statistically significant improvement (up to 2 percentage point and $p<0.05$) over both fine-tuning and standard logits-based distillation on three tasks (CoLA, RTE and MRCP) of the GLUE benchmark.

Paper Type: Short

Research Area: Efficient/Low-Resource Methods for NLP

Research Area Keywords: distillation, representation learning

Contribution Types: Approaches to low-resource settings, Approaches low compute settings-efficiency

Languages Studied: English

Submission Number: 307

Loading