BRIM: Block-wise Return Induction Method for Sequence Knowledge Distillation

Jiabin Fan; Guoqing Luo; Michael Bowling; Lili Mou

BRIM: Block-wise Return Induction Method for Sequence Knowledge Distillation

Jiabin Fan, Guoqing Luo, Michael Bowling, Lili Mou

12 Sept 2025 (modified: 11 Dec 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Reinforcement Learning, Knowledge distillation, Text Generation

TL;DR: We propose a novel block-wise return induction method for RL-based knowledge distillation, which mitigate the high variance issue in RL and stabilize the training process.

Abstract: Reinforcement Learning (RL)-based knowledge distillation (KD) is increasingly used to train language models for text generation. However, existing methods suffer from high variance caused by long action chains during sampling. To address this, we propose a novel block-wise return induction approach (called BRIM) that mitigates the high variance issue and stabilizes the training process. Our idea is to apply the Bellman Optimality Equation inversely to each $K$-step block segmented student's explored trajectories, and thus induce a total reward for all blocks from the teacher model, serving as the policy-gradient training signal. Theoretical analysis shows that our BRIM reduces the variance of the gradient estimates, thus leading to improved RL optimization, especially when the student model size is large. Empirical evaluation on three text generation tasks demonstrates that our approach yields superior performance in both standard task metrics and large language model (LLM)-based evaluation, which suggests that our BRIM offers a promising direction for enhancing RL-based KD in LLM research.

Primary Area: reinforcement learning

Submission Number: 4591

Loading