
# bi-GRPO


bi-GRPO: Bidirectional Optimization for Jailbreak Backdoor Injection on LLMs 
---

<table style="width: 100%;">
  <tr>
    <td style="width: 50%; text-align: center;">
      <img src="pics/introduction-figure1-v5.png" style="width: 100%;" alt="Figure 1">
      <div>bi-GRPO framework</div>
    </td>
    <td style="width: 50%; text-align: center;">
      <img src="pics/introduction-figure3-v2.png" style="width: 100%;" alt="Figure 2">
      <div>bi-GRPO performance</div>
    </td>
  </tr>
</table>

---

## Installation

```bash
conda create -n bigrpo python=3.9
pip install torch==2.4.0 --index-url https://download.pytorch.org/whl/cu121
pip3 install vllm==0.6.3 ray
pip3 install flash-attn --no-build-isolation
pip install -e .  # For verl integration
pip install wandb IPython matplotlib
```

---

## Data Preparation

You can directly use /data.

For your own data generation, here's a demo:

```bash
python ./examples/data_preprocess/jailbreak.py \
    --local_dir {processed_data_path} \
    --data_path {raw_data_path}
```

---

## Training Execution
```bash
conda activate bigrpo
bash main_bigrpo.sh  # 4×A100 80G
```

## Evaluation

Use trained qwen model to generate results.
```bash
python test-qwen.py --model_path </path/to/your/qwen-model> --data_dir </path/to/evaluation_data> --output_dir </path/to/output_results>
```

Evaluate existing results using LLaMa-Guard model.
```bash
python eval_llama_guard.py --model_path </path/to/llama-guard-model> --input_file </path/to/generated_responses.csv> --log_file llama_guard_eval.log
```
Or longformer classiflier.
```bash
python eval_longformer.py --model_path </path/to/longformer-classifier> --csv_file </path/to/generated-responses.csv>
```
## Pre-trained Models

You can download pretrained models here:

- [LLaMa-Guard-3-8b](https://hf-mirror.com/meta-llama/Llama-Guard-3-8B) 
- [Longformer-Classiflier](https://huggingface.co/LibrAI/longformer-action-ro)
- [Qwen2.5-7b-instruct](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct)
- [Qwen2.5-14b-instruct](https://huggingface.co/Qwen/Qwen2.5-14B-Instruct)
- [LLaMa2-7b-chat](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf)




## ⚙️ Implementation Details

| Component              | Location                          |
|------------------------|-----------------------------------|
| Pairwise-Reward Modeling     | `verl/trainer/main_ppo.py`   |
| Length & Format Rule-based Reward   | `verl/utils/reward_score/jailbreak.py`  |
| Pairwise Rollout   | `verl/workers/rollout/vllm_rollout/vllm_rollout.py`  |
| Dataset Processing | `verl/utils/dataset/rl_dataset.py` |
| LLaMa Guard        | `verl/workers/fsdp_workers.py`   |
---


## Citation
soon

---

## Acknowledgements
- [Verl](https://github.com/volcengine/verl) 🔗
---
