The Student Becomes the Teacher: A Reverse Distillation Approach for Data-Efficient Knowledge Transfer Between Language Models

25 Nov 2023 (modified: 21 Feb 2024)Submitted to SAI-AAAI2024EveryoneRevisionsBibTeXCC BY 4.0
Keywords: superstilling, knowledge distillation, large-scale language models, data efficiency, sample complexity
TL;DR: This paper presents \textit{superstilling}, an adaptation of Hinton et al.'s well-known distillation technique to transfer parametric knowledge between models with vastly different sizes, forward propagation methods, and weight update algorithms.
Abstract: The Transformer architecture has revolutionized the Natural Language Processing (NLP) community by providing immense gains in accuracy for several NLP tasks, especially through the creation of Large Language Models (LLMs). Transformers will not remain state-of-the-art, however. As superior architectures, especially those implemented on neuromorphic accelerators, become available, we will need cross-architecture pretraining methods that efficiently transfer knowledge from outdated machine learning models to more advanced ones. This paper presents superstilling, an adaptation of Hinton et al.'s well-known distillation technique to transfer parametric knowledge between models with vastly different sizes, forward propagation methods, and weight update algorithms. We validate this method on one of these three possibilities - transferring knowledge from a small model to a much larger one - and show that superstilling can decrease sample complexity by up to 50\% during early pretraining, and by more than 10\% at the knowledge saturation point.
Submission Number: 13
Loading