Multi-Teacher Knowledge Distillation Augmented Group Relative Policy Optimization

Multi-Teacher Knowledge Distillation Augmented Group Relative Policy Optimization

TMLR Paper5347 Authors

09 Jul 2025 (modified: 09 Oct 2025)Rejected by TMLREveryoneRevisionsBibTeXCC BY 4.0

Abstract: Transfer learning, a key paradigm for leveraging pre-existing knowledge, can significantly enhance reinforcement learning agents, particularly when dealing with Large Language Models (LLMs) and Small Language Models (SLMs). Knowledge Distillation (KD) provides a potent mechanism for this transfer from expert LLM teacher models to SLM student models. Group Relative Policy Optimization (GRPO) is a robust critic-free reinforcement learning algorithm effective for policy optimization by estimating advantage via intra-group reward comparisons. Standard GRPO, however, does not inherently incorporate guidance from external expert policies and can exhibit training instability. This paper introduces a novel theoretical framework to integrate multi-teacher KD with GRPO. We propose a family of GRPO-KD objective functions; our primary formulation augments GRPO with an explicit, adaptively weighted multi-teacher distillation term to preserve stability for the SLM training. We further explore two advanced strategies: one modifying the Kullback-Leibler (KL) regularization of GRPO, and another introducing a Teacher Agreement Score to directly modulate the advantage calculation for deeper guidance from multiple LLM teachers. Experimental results on benchmark reasoning tasks demonstrate that the proposed framework not only stabilizes the training process but also significantly outperforms standard GRPO and other baseline approaches, validating the effectiveness of synergizing critic-free RL with multi-teacher guidance.

Submission Type: Regular submission (no more than 12 pages of main content)

Assigned Action Editor: ~Nino_Vieillard1

Submission Number: 5347

Loading