Keywords: reinforcement learning, post-training, knowledge distillation
TL;DR: We present CoDistill-GRPO, a variant of GRPO aimed to improve the gains of GRPO for small language models, while improving the training efficiency of LLMs.
Abstract: Group Relative Policy Optimization (GRPO) has emerged as a powerful algorithm for improving reasoning in language models, but often fails to improve small models due to sparse rewards on difficult tasks. Existing works attempt to mitigate this issue by leveraging a larger language model as a frozen oracle, either to provide hints for rollouts or for knowledge distillation (KD). However, this assumes the existence of such an oracle, and training one can significantly increase total training time. In this work, we propose CoDistill-GRPO, a co-distillation algorithm that simultaneously trains a large and a small model by maximizing the GRPO objective. The two models learn from each other: the small model uses an on-policy KD reward to learn from the large model's distribution, while the large model is updated using rollouts generated by the small model, reducing the computational overhead of rollout generation. We show that CoDistill-GRPO substantially improves small-model performance over GRPO on mathematical benchmarks across both Qwen and Llama models, with an accuracy increase of 6.0 percentage points on the Minerva dataset for Qwen2.5-Math-1.5B. Interestingly, we also show that the large model trained with CoDistill-GRPO can nearly match standard GRPO performance despite training on small model rollouts. This highlights CoDistill-GRPO as a potential cost-effective alternative to GRPO for larger models.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 3
Loading