# VidGuard-R1: Setup & Usage Guide

VidGuard-R1 is a video reasoning model trained to distinguish real-world videos from those synthesized by AI generators. This guide provides full instructions on environment setup, data preparation, prompt engineering for Chain-of-Thought (CoT) reasoning, and training execution using both supervised fine-tuning and GRPO.

---

## 🛠️ Environment Setup

Start a screen session (optional but recommended for long-running tasks):

```bash
screen -S video

cd VidGuard-R1

# Create a new Conda environment
conda create -n vidguard-r1 python=3.11 -y
conda init
source ~/.bashrc
conda activate vidguard-r1

# Install dependencies
bash setup.sh
```

## 📦 Qwen-VL Utilities (Video Processing)
```bash
cd src/qwen-vl-utils
pip install -e .[decord]
cd ..
```

## 🔄 Install transformers (specific dev version)
```bash
pip install git+https://github.com/huggingface/transformers.git@336dc69d63d56f232a183a3e7f52790429b871ef
```

## 📁 Dataset Collection
We are currently curating our dataset for public release. In the meantime, we recommend downloading datasets from the following benchmarks:
- GenVideo: https://github.com/chenhaoxing/DeMamba
- GenVidBench: https://github.com/genvidbench/GenVidBench
## 🧠 CoT Annotation Collection

You must generate a JSON file for CoT annotations and place it under: src/r1-v/Video-Ours-data/{dataset_name}.json



### 📝 Prompt Templates

**For AI-generated videos:**

```text
<video> This video has been generated by an AI model. Your task is to discriminate between a real and an AI-generated video considering these factors: Motion Consistency, Lighting Consistency, Texture Artifacts, and Physics Violations. Please think about this question as if you were a human pondering deeply. Engage in an internal dialogue using expressions such as 'let me think', 'wait', 'Hmm', 'oh, I see', 'let's break it down', etc. It's encouraged to include self-reflection or verification in the reasoning process.
```

**For real-world videos:**
```text
<video> This is a real-world video. Your task is to discriminate between a real and an AI-generated video considering these factors: Motion Consistency, Lighting Consistency, Texture Artifacts, and Physics Violations. Please think about this question as if you were a human pondering deeply. Engage in an internal dialogue using expressions such as 'let me think', 'wait', 'Hmm', 'oh, I see', 'let's break it down', etc. It's encouraged to include self-reflection or verification in the reasoning process.
```

### 🔧 Generate Annotation File
```bash
python src/generate_cot_vllm.py
```

## 🎯 Training
### 🔹 Supervised Fine-Tuning (CoT)
Make sure you have the training dataset and annotation JSON ready. Update the dataset_name parameter in the script:
```bash
bash src/scripts/run_sft_video.sh
```

### 🔸 GRPO Fine-Tuning
Likewise, prepare your dataset and annotation JSON file, and update the dataset_name:
```bash
bash src/scripts/run_grpo_video_discriminator.sh
```




## ⚠️ FlashAttention Issue Fix
If you encounter errors related to flash-attn, reinstall using:
```bash
pip uninstall flash-attn -y
pip install git+https://github.com/Dao-AILab/flash-attention.git
```

## 📄 JSON Format Example
Below is an example JSON format for annotated video samples:
```json
[
  {
    "problem_id": 1,
    "problem": "<video> Decide whether a video looks a real one or a generated from the AI world model. Respond with one of these two labels inside the <answer></answer>: Real and Generated.",
    "data_type": "video",
    "problem_type": "multiple choice",
    "options": [
      "A. The generated video from the AI model",
      "B. The real-world video"
    ],
    "solution": "<answer>B</answer>",
    "path": "Gen-Video/Youku_1M_10s_unzipped/Youku_1M_10s/0330000_0339999/yplug_pre_train_0334104_10_10.mp4",
    "data_source": null
  },
  {
    "problem_id": 2,
    "problem": "<video> Decide whether a video looks a real one or a generated from the AI world model. Respond with one of these two labels inside the <answer></answer>: Real and Generated.",
    "data_type": "video",
    "problem_type": "multiple choice",
    "options": [
      "A. The generated video from the AI model",
      "B. The real-world video"
    ],
    "solution": "<answer>B</answer>",
    "path": "Gen-Video/Youku_1M_10s_unzipped/Youku_1M_10s/0500000_0509999/yplug_pre_train_0503825_35_10.mp4",
    "data_source": null
  },
  ...
]

```
