<div align="center">

# Reinforcement Learning via Self-Distillation (SDPO)

</div>

Our implementation builds on top of a recent version of [verl](https://github.com/verl-project/verl):

```
@article{sheng2024hybridflow,
  title   = {HybridFlow: A Flexible and Efficient RLHF Framework},
  author  = {Guangming Sheng and Chi Zhang and Zilingfeng Ye and Xibin Wu and Wang Zhang and Ru Zhang and Yanghua Peng and Haibin Lin and Chuan Wu},
  year    = {2024},
  journal = {arXiv preprint arXiv: 2409.19256}
}
```

## 📖 Introduction

Large language models are increasingly post-trained with reinforcement learning in verifiable domains such as code and math. Yet, current methods for reinforcement learning with verifiable rewards (RLVR) learn only from a scalar outcome reward per attempt, creating a severe credit-assignment bottleneck. Many verifiable environments actually provide rich textual feedback, such as runtime errors or judge evaluations, that explain *why* an attempt failed. We formalize this setting as *Reinforcement Learning with Rich Feedback* (RLRF):

<p align="center">
<img src="figures/sdpo-fig-training-loop.png" alt="Reinforcement Learning from Rich Feedback" width="60%">
</p>

**We propose Self-Distilled Policy Optimization (SDPO)**, a reinforcement learning framework that augments on-policy optimization with self-distillation from the model’s own high-reward trajectories.

SDPO converts tokenized feedback into a dense learning signal without any external teacher or explicit reward model. SDPO treats the current model conditioned on feedback as a self-teacher and distills its feedback-informed next-token predictions back into the policy. In this way, SDPO leverages the model's ability to retrospectively identify its own mistakes in-context.

<p align="center">
<img src="figures/sdpo-fig.png" alt="SDPO" width="60%">
</p>

---

## 📊 Main Results

### Learning without Rich Environment Feedback

SDPO turns successful rollouts into reusable supervision, allowing the policy to directly learn from its own best generations—without requiring external demonstrations or additional datasets.

When environment feedback is sparse or rule-based, standard reinforcement learning methods struggle to propagate learning signals efficiently. SDPO addresses this by reusing high-reward rollouts as implicit demonstrations, providing dense supervision even in the absence of rich feedback.

<p align="center">
<img src="figures/chemistry-accuracy-response.png" alt="SDPO Performance vs. Training Steps" width="60%">
</p>

*Training progression of Olmo3-7B-Instruct on Chemistry. We report the average accuracy across 16 samples per question and a rolling average of response lengths over 5 steps. We report GRPO with the optimal hyperparameters for this model and task.*

<p align="center">
<img src="figures/table-no-rich-feedback.png" alt="SDPO Performance without Rich Environment Feedback" width="60%">
</p>

***Comparison of SDPO and GRPO on reasoning-related benchmarks.** We report the highest achieved avg@16 within 1 hour and 5 hours of wall-clock training time, respectively.
Both SDPO and on-policy GRPO perform one gradient step per generation batch, while GRPO performs 4 off-policy mini batch steps. We select optimal hyperparameters for SDPO and baselines based on 5h accuracy. Each run is performed on a node with 4 NVIDIA GH200 GPUs. Together with initialization and validation, each run takes approximately 6 hours.*

---

### Learning with Rich Environment Feedback

In settings where environments provide structured or textual feedback, SDPO naturally incorporates this information into self-distillation. By conditioning future attempts on both successful demonstrations and feedback from failed attempts, SDPO achieves faster convergence and more stable training.

<p align="center">
<img src="figures/lcbv6-accuracy.png" alt="SDPO Performance with Rich Environment Feedback" width="60%">
</p>

***SDPO with rich environment feedback.**
Left: SDPO benefits from denser credit assignment (logit > token > sequence-level) and consistently outperforms GRPO when rich feedback is available.
Right: The self-teacher improves throughout training, and the final student substantially surpasses the initial teacher. Error bars show variability across seeds.*

---

### Solving Hard Questions via Test-Time Self-Distillation

SDPO also enables **test-time self-distillation**. By generating multiple candidate solutions, identifying high-quality responses, and reusing them as demonstrations, the model can iteratively refine its outputs at inference time.  This leads to substantial gains on hard reasoning tasks without additional training.

<p align="center">
<img src="figures/very-hard-questions.png" alt="Test-Time Self-Distillation" width="60%">
</p>

***Test-time self-distillation on hard coding problems.**
SDPO solves questions that neither the base model nor multi-turn interaction can solve, achieving higher solution discovery rates across generation budgets.*

