Keywords: Large Language Models, Llama, Instruction Tuning, Distillation
TL;DR: In this work, we introduce SLIM~(Sparse Logit Infused Modeling), a simple method for distilling LLMs that leverages not only samples from the teacher LLM but also the values of the logits produced at each decoding step.
Abstract: The unwieldy size of state-of-the-art language models
presents significant obstacles for deployment,
driving up cost and latency.
While prior works have offered methods
for distilling these larger language models
into smaller students,
the best previous method is somewhat complex,
relying on an RL-based optimization.
In this work, we introduce SLIM (Sparse Logit Infused Modeling),
a simple method for distilling LLMs
that leverages not only samples from the teacher LLM
but also the values of the logits produced at each decoding step.
Our distillation method uses only the top-5% highest logits along with a dynamic weighting scheme that assigns weights to the KL divergence and cross-entropy loss based on the relative confidence between the student and teacher models.
Our experiments demonstrate that SLIM produces models
that are better at a wide range of downstream NLP tasks
compared to supervised fine-tuning, vanilla knowledge distillation, and the recently proposed MiniLLM.
Contrary to other methods, our method is scalable
to much larger teacher ($\sim70$B parameters).
We also provide an intuition for the superior performance of SLIM
via established sample complexity bounds within simplified scenarios.
Submission Number: 101
Loading