For Distillation, Tokens Are Not All You Need

Published: 28 Oct 2023, Last Modified: 26 Nov 2023Instruction Workshop @ NeurIPS 2023EveryoneRevisionsBibTeX
Keywords: Large Language Models, Llama, Instruction Tuning, Distillation
TL;DR: In this work, we introduce SLIM~(Sparse Logit Infused Modeling), a simple method for distilling LLMs that leverages not only samples from the teacher LLM but also the values of the logits produced at each decoding step.
Abstract: The unwieldy size of state-of-the-art language models presents significant obstacles for deployment, driving up cost and latency. While prior works have offered methods for distilling these larger language models into smaller students, the best previous method is somewhat complex, relying on an RL-based optimization. In this work, we introduce SLIM (Sparse Logit Infused Modeling), a simple method for distilling LLMs that leverages not only samples from the teacher LLM but also the values of the logits produced at each decoding step. Our distillation method uses only the top-5% highest logits along with a dynamic weighting scheme that assigns weights to the KL divergence and cross-entropy loss based on the relative confidence between the student and teacher models. Our experiments demonstrate that SLIM produces models that are better at a wide range of downstream NLP tasks compared to supervised fine-tuning, vanilla knowledge distillation, and the recently proposed MiniLLM. Contrary to other methods, our method is scalable to much larger teacher ($\sim70$B parameters). We also provide an intuition for the superior performance of SLIM via established sample complexity bounds within simplified scenarios.
Submission Number: 101
Loading