<div align="center">
    <h1 align="center"> Pref-GRPO: Pairwise Preference Reward-based GRPO for Stable Text-to-Image Reinforcement Learning
    </h1>
</div>

![pref_grpo_pipeline](/assets/pref_grpo_pipeline.png)



![pref_grpo_pipeline](/assets/pref_grpo_reward_hacking.png)



## 🔧 Environment Set Up

1. Install the training package:
```bash
conda create -n PrefGRPO python=3.12
conda activate PrefGRPO

bash env_setup.sh fastvideo

cd open_clip
pip install -e .
cd ..
```

2. Download Reward Models
```bash
huggingface-cli download CodeGoat24/UnifiedReward-qwen-7b
huggingface-cli download CodeGoat24/UnifiedReward-Think-qwen-7b

wget https://huggingface.co/apple/DFN5B-CLIP-ViT-H-14-378/resolve/main/open_clip_pytorch_model.bin
```
## 💻 Training

#### 1. Deploy vLLM server

1. Install vLLM
```bash
pip install vllm==0.9.0.1 transformers==4.52.4
```
2. Start server
```bash
bash vllm_utils/vllm_server_UnifiedReward_Think.sh  
```
#### 2. Preprocess training Data 
we use training prompts in UniGenBench, as shown in ```"./data/unigenbench_train_data.txt"```.

```bash
bash fastvideo/data_preprocess/preprocess_flux_rl_embeddings.sh
```


#### 3. Train
```bash
bash finetune_prefgrpo_flux.sh
```

### 🚀 Inference and Evaluation
we use test prompts in UniGenBench, as shown in ```"./data/unigenbench_test_data.csv"```.
```bash
bash inference/flux_dist_infer.sh
```

Then, evaluate the outputs following UniGenBench.

