Towards Cross-Tokenizer Distillation: the Universal Logit Distillation Loss for LLMs

Published: 13 Jan 2025, Last Modified: 13 Jan 2025Accepted by TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: Deploying large language models (LLMs) with billions of parameters is often impractical in industrial settings due to constraints like cost, latency, and hardware limitations. Knowledge distillation (KD) provides a solution by compressing the knowledge from large, resource-intensive models into task-specific smaller ones. Various strategies exist, some relying on the text generated by the teacher model, optionally, leveraging its output logits to improve learning. However, these logit-based methods usually require the teacher and student models to share the same tokenizer, which limits their applicability across different model families. In this paper, we propose the Universal Logit Distillation (ULD) loss, which uses optimal transport theory to enable distillation across different architectures and tokenizers. Our results demonstrate that ULD loss effectively facilitates the distillation process, paving the way for a more widespread use of distillation.
Submission Length: Regular submission (no more than 12 pages of main content)
Supplementary Material: zip
Assigned Action Editor: ~Frederic_Sala1
Submission Number: 3045