Keywords: Reinforcement Learning, Knowledge distillation, Text Generation
TL;DR: We propose a novel block-wise return induction method for RL-based knowledge distillation, which mitigate the high variance issue in RL and stabilize the training process.
Abstract: Reinforcement Learning (RL)-based knowledge distillation (KD) is increasingly used to train language models for text generation. However, existing methods suffer from high variance caused by long action chains during sampling. To address this, we propose a novel block-wise return induction approach (called BRIM) that mitigates the high variance issue and stabilizes the training process.
Our idea is to apply the Bellman Optimality Equation inversely to each $K$-step block segmented student's explored trajectories, and thus induce a total reward for all blocks from the teacher model, serving as the policy-gradient training signal.
Theoretical analysis shows that our BRIM reduces the variance of the gradient estimates, thus leading to improved RL optimization, especially when the student model size is large. Empirical evaluation on three text generation tasks demonstrates that our approach yields superior performance in both standard task metrics and large language model (LLM)-based evaluation, which suggests that our BRIM offers a promising direction for enhancing RL-based KD in LLM research.
Primary Area: reinforcement learning
Submission Number: 4591
Loading