# GRPO Training Utilities

This code contains example scripts and utilities to train and evaluate models from Reasoner Needs a Listener  using Group Relative Policy Optimization (GRPO) as described in [DeepSeekMath](https://huggingface.co/papers/2402.03300).

## File Descriptions

- **grpo_trainer.py**  
  Implements `Qwen2VLGRPOTrainer`, a custom HuggingFace `Trainer` subclass for GRPO training of causal & multimodal models. We forked GRPO code from https://github.com/huggingface/open-r1. 

- **hpsv_training.py**  
  Main training script: parses arguments, loads data, configures the model, and launches GRPO training via `Qwen2VLGRPOTrainer`. Contains custom rewards implementations, including naive GRPO and newer listener-based rewards. 

- **check_contradictions_example.py**  
  Example standalone script to detect contradictions between a model’s `<think>…</think>` reasoning and its final `<answer>…</answer>`.

- **soft_score_rapidata.py**  
  Script to compute “soft” image‐preference scores on the Rapidata dataset checkpoints; collects JSONL inference results and outputs per‐example CSV scores.

- **calc_soft_scores.py**  
  Distributed version of soft-score computation: loads `.jsonl` traces, re-scores with a Qwen2.5-VL model, and writes out CSV.

- **rapidata_major_vote.py**  
  Runs majority‐vote aggregation on Rapidata inference outputs, compares to ground truth, and reports accuracy.

- **configs/zero3.yaml**  
  A DeepSpeed Zero-3 configuration file for fully sharded data‐parallel training.

## Requirements
Requirements follow https://github.com/om-ai-lab/VLM-R1

- Python ≥3.8  
- PyTorch  
- Transformers (≥4.47.0)  
- `trl`, `datasets`, `accelerate`, `peft`, `wandb` (optional)  
- DeepSpeed (if using `--deepspeed`)  

## Usage

Here’s an example of launching distributed training on 8 GPUs with DeepSpeed:

```bash
torchrun --nproc_per_node=8 --nnodes=1 \
  --node_rank=0 --master_addr="127.0.0.1" --master_port="12347" \
  hpsv_training.py \
    --deepspeed configs/zero3.yaml \
    --output_dir output/placeholder_name \
    --model_name_or_path Qwen/Qwen2.5-VL-7B-Instruct \
    --dataset_name data_config/rec.yaml \
    --max_prompt_length 2048 \
    --num_generations 10 \
    --per_device_train_batch_size 1 \
    --gradient_accumulation_steps 4 \
    --logging_steps 1 \
    --bf16 \
    --torch_dtype bfloat16 \
    --data_seed 42 \
    --report_to wandb \
    --gradient_checkpointing true \
    --attn_implementation flash_attention_2 \
    --num_train_epochs 2 \
    --run_name name_of_run \
    --save_steps 200 \
    --save_only_model false \
    --learning_rate 1e-6
```

Adjust paths, model IDs, and hyperparameters as needed.
