# DR-IRL: Inverse Reinforcement Learning with Dynamic Reward Scaling for LLM Alignment


[![License](https://img.shields.io/badge/License-Apache%202.0-blue)](LICENSE)
[![arXiv](https://img.shields.io/badge/arXiv-2405.12345-b31b1b)]()

Official implementation of the paper **"Dual-Regularized Inverse Reinforcement Learning for Robust LLM Alignment"** . This repository provides a complete framework for safety-aligned LLM training through three core innovations:

1. 🛡️ **Shadow Reward Learning** - Category-specific reward modeling with stability regularization
2. 📊 **DHMR Measurement** - Dynamic hardness-aware scaling combining data complexity and model responsiveness
3. 🚀 **GRPO-S Optimization** - Group-relative policy optimization with adaptive advantage scaling

![Framework](asset/pipeline.png)

## Table of Contents
- [Installation](#installation)
- [Quick Start](#quick-start)
- [Citation](#citation)


## Installation

### System Requirements
- Linux (tested on Ubuntu 22.04)
- NVIDIA GPU with CUDA 11.7+
- Python 3.9+

### Dependency Setup
```bash
# Create conda environment
conda create -n dr_irl python=3.9
conda activate dr_irl

# Install project dependencies
pip install -r requirements.txt

# Install CLIP text encoder
pip install git+https://github.com/openai/CLIP.git

```

## Quick Start

### 1.Train Shadow Reward Models

```bash

torchrun --nproc_per_node=8 \
    IRL/train_shadow.py \
    --model_name_or_path meta-llama/Meta-Llama-3-8B-Instruct \
    --data_dir ./data/harmful_categories \
    --output_dir ./shadow_models \
    --categories crimes unfairness mental_health \
    --per_device_train_batch_size 8 \
    --num_train_epochs 5 \
    --learning_rate 3e-5 \
    --fp16
```

### 2.Run GRPO-S Training

```bash
deepspeed --num_gpus 8 train_grpos.py \
    --policy_model_name meta-llama/Meta-Llama-3-8B-Instruct \
    --shadow_model_dir ./shadow_models \
    --demonstration_data ./data/alignment_prompts.json \
    --output_dir ./aligned_model \
    --per_device_train_batch_size 4 \
    --gradient_accumulation_steps 16 \
    --learning_rate 2e-5 \
    --kl_coeff 0.1 \
    --max_seq_length 2048 \
    --bf16

```

### Citation

