On LLM Knowledge Distillation - A Comparison between Forward KL and Reverse KL

Published: 23 Jan 2025, Last Modified: 26 Feb 2025ICLR 2025 Blogpost TrackEveryoneRevisionsBibTeXCC BY 4.0
Abstract:

In this blog post, we delve into knowledge distillation techniques for Large Language Models (LLMs), with a particular focus on using Kullback-Leibler (KL) Divergence as the optimization objective. Knowledge distillation is a powerful tool to reduce model size while maintaining comparable performance, making it especially useful in scenarios with constrained computational or serving resources. We specifically explore the nuances of Forward KL divergence and Reverse KL divergence, examining their roles in the distillation process. By comparing these two approaches, we aim to uncover their behaviours, strengths, and practical applications in LLM distillation.

Conflict Of Interest:

Papers mentioned:

  1. Rethinking Kullback-Leibler Divergence in Knowledge Distillation for Large Language Models, Taiqiang Wu, Chaofan Tao, Jiahao Wang, Runming Yang, Zhe Zhao, Ngai Wong
  2. MiniLLM: Yuxian Gu, Li Dong, Furu Wei, Minlie Huang
  3. GKD: Rishabh Agarwal, Nino Vieillard, Yongchao Zhou, Piotr Stanczyk, Sabela Ramos, Matthieu Geist, Olivier Bachem

Previous collaboraters: Lehigh University, Carnegie Mellon University, LinkedIn

Submission Number: 27
Loading