# DeepCritic: Deliberate Critique with Large Language Models


## Installation
- For critique generation and supervised fine-tuning, please follow the instructions in the open-source platform [alignment-handbook](https://github.com/huggingface/alignment-handbook) to create the environment.

- For RL training, our code is mainly based on the open-source framework [verl](https://github.com/volcengine/verl), so please directly follow the instrucions in [verl](https://github.com/volcengine/verl) to build the environment.


## Data
Our curated 4.5K SFT data (``data/prm800k/phase2_train_final_critique.jsonl``) and used evaluation data are in the ``data/`` directory. The raw data for SFT and RL data generation can be obtained from the open-source and huaman-annotated [PRM800K](https://github.com/openai/prm800k).

## Deliberate Critique Generation
The code for generating SFT data is in the ``Critique_Generation/`` directory. We provide example commands in ``scripts/run_critique_gen.sh``, you can use the provided commands to generate the SFT data:
```bash
sh scripts/run_critique_gen.sh
```

## Supervised Fine-Tuning
Our SFT code is mainly based on the open-source platform [alignment-handbook](https://github.com/huggingface/alignment-handbook). After creating the deep critique data (or directly using our provided data), please convert it to the required format for training:
```python
python3 sft/convert_data.py
```

Then, you can perform SFT by running
```bash
sh sft/run_sft.sh
```

## Reinforcement Learning
### RL Data Generation
We provide the code for RL data generation via Monte Carlo sampling-based correctness estimation in the ``Critique_Generation/`` directory. The example commands are in ``scripts/run_rollout.sh``.
```bash
sh scripts/run_rollout.sh
```

### RL Training
Our RL training is directly based on the open-source training platform [verl](https://github.com/volcengine/verl). You can easily use the source code of [verl](https://github.com/volcengine/verl) to perform RL training.

## Evaluation
Our evaluation code is mainly adopted from the open-source benchmark [ProcessBench](https://github.com/QwenLM/ProcessBench), and you can run the following script to perform evaluation on critique models:
```bash
sh scripts/run_eval.sh
```