Keywords: LLMs, Weight Initialization, Distillation
Abstract: Algorithmic efficiency techniques such as distillation (\cite{hinton2015distillation}) are useful in improving the model quality without increasing serving costs, provided a larger teacher model is available for a smaller student model to learn from during training. Standard distillation methods are limited to forcing the student to match only the teacher's outputs. Given the costs associated with training a large model, we believe that we should be extracting more useful information from a teacher model than just making the student match the teacher's outputs.
In this paper, we introduce \guide (Guided Initialization and Distillation of Embeddings). \guide can be considered an initialization technique (a special distillation technique) that forces the student to match the teacher in the parameter space. Using \guide we show a 25-26\% reduction in the teacher-student quality gap when using large student models (400M - 1B parameters) trained on $\approx$ 20B tokens. We also present a detailed analysis that shows that \guide can be combined with knowledge distillation with near additive improvements. Furthermore, we show that applying \guide alone leads to substantially better model quality than applying knowledge distillation by itself.
Most importantly, \guide introduces no training or inference overhead, and hence any model quality gains from our method are virtually free.
Submission Number: 98
Loading