GAKD: Generative Adversarial Knowledge Distillation For Large Language Models

Yuan Pu; Shutong Lin; Xi Chen; Seunggeun Kim; Sungyoung Lee; Zhuolun He; David Z. Pan; Bei Yu

GAKD: Generative Adversarial Knowledge Distillation For Large Language Models

Yuan Pu, Shutong Lin, Xi Chen, Seunggeun Kim, Sungyoung Lee, Zhuolun He, David Z. Pan, Bei Yu

18 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: knowledge distillation, LLM

TL;DR: This paper introduces GAKD, a novel adversarial knowledge distillation method that leverages adversarial training and reverse KL divergence to enhance student model alignment with large language model teachers.

Abstract: Current white-box knowledge distillation (KD) methods for large language models (LLMs) often rely on distribution distance metrics, such as forward or reverse Kullback–Leibler Divergence (KLD), as optimization objectives. However, KLD objective only provides token-wise feedback during knowledge distillation, lacking long-range, sequence-level signals and leading to poor distribution alignment between the teacher and student models. To address this, we propose the Generative Adversarial Knowledge Distillation (GAKD) framework, which adopts a minimax adversarial strategy. Specifically, GAKD trains: (1) a generator (student) to align with the teacher's distribution via a combination of sequence-level adversarial loss and reverse KLD loss, and (2) a discriminator to distinguish whether per-token logits are from the teacher or student. By jointly minimizing the token-level reverse KLD and sequence-level adversarial losses, GAKD enables the student model to more effectively align with the teacher’s distribution, leading to improved performance. Furthermore, we provide a mathematical proof of the feasibility of optimizing reverse KLD loss on teacher-generated sequences, establishing the theoretical soundness of GAKD. Experimental results on the instruction-following tasks, conducted on the Qwen-3 model families (with parameters ranging from 0.6B to 8B), demonstrate that utilizing the sequence-level signals, GAKD generates more accurate responses than the SOTA baselines, especially in the long-text generation scenario. Our code can be found in https://anonymous.4open.science/r/GAKD-8753/.

Supplementary Material: zip

Primary Area: generative models

Submission Number: 10374

Loading