
# TOKEN-REGULATED GROUP RELATIVE POLICY OPTIMIZATION FOR STABLE REINFORCEMENT LEARNING IN LARGE LANGUAGE MODELS

## Build Up Environment

Our code has been successfully tested on 8×80GB H100 GPUs with CUDA 12.1. The following commands will create a Conda environment with all the required dependencies:

```bash
  conda create -n AR_Lopti python=3.9
  conda activate AR_Lopti
  pip install torch==2.4.0 --index-url https://download.pytorch.org/whl/cu121
  pip3 install vllm==0.6.3 ray
  pip3 install flash-attn --no-build-isolation
  pip install -e .
  pip install wandb IPython matplotlib
  pip install torchdata==0.8.0
  pip install pylatexenc
  pip install tensordict==0.5.0
```

## Run the Code

After setting up the environment, you can run the code with the following command:

* For GRPO Baseline
  ```bash
    bash scripts/train_kklogic_baseline_4x80GB.sh
  ```
* For TR-GRPO
  ```bash
    bash scripts/train_TR_GRPO_4x80GB.sh
  ```

The models will be continuously evaluated during training, and all experimental records will be automatically logged to the `wandb` platform.

Please note that the model to be trained can be modified in **Lines 4-5** of each bash script. The default setting is `Qwen/Qwen2.5-7B-Instruct-1M`, and another option is `Qwen/Qwen2.5-3B-Instruct`.
