## 🛠️ Set up

```bash
# install dependencies for LLaMA-Factory
pip3 install -e ".[torch,metrics]"
# install deepspeed for training
pip3 install deepspeed==0.14.5
# install vllm for inference
pip3 install vllm==0.6.3
pip3 install datasets==2.21.0
# install dependency to extract answer in the response
pip3 install latex2sympy2
# install dependencies for openrlhf
cd RL/OpenRLHF
pip3 install -e .
pip3 install transformers==4.47.1
```

## ⚡️ Usage

### Step 1: SFT Training

To initialize an actor or a critic, first set the model path, template and the output path in the `RL/OpenRLHF/Critique-RL/pipeline/scripts/step1-SFT.sh` script. Then, run the following command:

```bash
cd RL/OpenRLHF/Critique-RL/pipeline/scripts/step1-SFT.sh
bash step1-SFT.sh
```

### Step 2: Inference

To generate the inference result produced by the actor (first attempt to the question), set the actor model path and template (should be the sft model from **Step1**), path to save inference result in the `RL/OpenRLHF/Critique-RL/pipeline/scripts/step2-inference.sh`  script. Then, run the following command:

```bash
cd RL/OpenRLHF/Critique-RL/pipeline/scripts/
bash step2-SFT.sh
```

After that, 3 files will be generated:

- inference_results_file: contains correct inference result 
- ppo_prompt_path/false_temp.json: contains incorrect inference result
- ppo_prompt_path/all_inference.json: inference result including both correct and incorrect

### Step 3: Critique-RL Stage I

Critique-RL is a two-stage method. In stage I, it optimizates discriminability through direct reward signals. To train a stage I-critique model, set the actor model path and template, critique model path and template (both models should be the sft model from **Step 1**), inference result (generated in **Step 2**) and the output path in the `RL/OpenRLHF/Critique-RL/pipeline/scripts/step3-Critique-RL-stage1.sh` script.  Then, run the following command:

```bash
cd RL/OpenRLHF/Critique-RL/pipeline/scripts/
bash step3-Critique-RL-stage1.sh
```

### Step 4: Critique-RL Stage II

In stage II, Critique-RL optimizates helpfulness while maintaining discriminability. To train a stage II-critique model, set the actor model path and template (should be the sft model from **Step1**), critique model and template (should be the model trained from the **Step 3**), inference result (generated in **Step 2**) and the output path in the `RL/OpenRLHF/Critique-RL/pipeline/scripts/step4-Critique-RL-stage2.sh` script.  Then, run the following command:

```bash
cd RL/OpenRLHF/Critique-RL/pipeline/scripts/
bash step4-Critique-RL-stage2.sh
```

### Evaluation

To evaluate the model's performance, set the dataset, actor model path and template, critique model path and the template. Then, run the following command:

```bash
cd RL/OpenRLHF/Critique-RL/pipeline/scripts/
bash evaluate.sh
```

