# Unlocking LLM Reasoning via Reinforcement Learning with Re-Solving

This repository contains the official implementation of the paper **"Unlocking LLM Reasoning via Reinforcement Learning with Re-Solving"**

Below are the instructions for setting up and running the experiments described in the paper.

## Introduction

Despite the strong performance achieved by large language models（LLMs）after reinforcement learning, they still suffer from issues such as overthinking and underthinking, generating low-quality reasoning steps that degrade both efficiency and performance. Our research find that if the initial reasoning steps are flawed, the model often cannot recover, even when it generates many more steps.

To equip models with this capability, we introduce Reinforcement Learning with Re-solving (Re$^2$), a novel framework in which the model can flexibly choose either to produce a final answer or to re-solve the problem at any point in its reasoning process. During training, the model learns to extend
partial reasoning trajectories and dynamically decide whether to restart the reasoning process based on its current progress.

We reward two following actions:

- When the reasoning trajectory is confused or leads in an incorrect direction, abandoning the current prefix and re-solving the problem.
- When the reasoning trajectory is promising, directly producing the final answer.

![](.\pics\intro.png)

## Requirements

To install the required dependencies, use the following commands:

```
conda create -n train_Re2 python=3.10
conda activate train_Re2
pip install -r requirements.txt
```

## Re$^2$ Training

To  run the Re$^2$ training process with the following scripts:

```
conda activate train_Re2
cd ./verl_redo_continue
bash mytrain.sh
```

## Main Reults	

The results are as follows (Set the Temperature=0.6,top-p=0.95,num_samples=128):

![](.\pics\result.png)

