Towards Cross-Tokenizer Distillation: the Universal Logit Distillation Loss for LLMs

TMLR Paper3045 Authors

21 Jul 2024 (modified: 28 Nov 2024)Decision pending for TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: Deploying large language models (LLMs) with billions of parameters is often impractical in industrial settings due to constraints like cost, latency, and hardware limitations. Knowledge distillation (KD) provides a solution by compressing the knowledge from large, resource-intensive models into task-specific smaller ones. Various strategies exist, some relying on the text generated by the teacher model, optionally, leveraging its output logits to improve learning. However, these logit-based methods usually require the teacher and student models to share the same tokenizer, which limits their applicability across different model families. In this paper, we propose the Universal Logit Distillation (ULD) loss, which uses optimal transport theory to enable distillation across different architectures and tokenizers. Our results demonstrate that ULD loss effectively facilitates the distillation process, paving the way for a more widespread use of distillation.
Submission Length: Regular submission (no more than 12 pages of main content)
Assigned Action Editor: ~Frederic_Sala1
Submission Number: 3045
Loading