Keywords: Inverse Reinforcement Learning, LLM Alignment, Group Relative Policy Optimization
TL;DR: Inverse Reinforcement Learning with Dynamic Reward Scaling for LLM Alignment.
Abstract: Alignment is vital for safely deploying large language models (LLMs). Existing techniques are either reward-based--train a reward model on preference pairs and optimize with reinforcement learning (RL)--or reward-free--directly fine-tune on ranked outputs. Recent research show that well-tuned reward-based pipelines remain the most robust, and single-response demonstrations can outperform pairwise preference data.
However, there still exist two key challenges: (1) imbalanced safety dataset that overrepresent common hazards while neglecting long-tail threats; and (2) static reward models that ignore task difficulty, limiting optimization efficiency and attainable gains.
To address these limitations, we propose DR-IRL, which Dynamically adjusts Rewards through Inverse Reinforcement Learning.
We first train category‑specific reward models using a balanced safety dataset of seven harmful categories as demonstration via IRL.
Then we enhance Group Relative Policy Optimization (GRPO) by introducing dynamic reward scaling--adjusting rewards by task difficulty--data-level hardness by text encoder cosine similarity, model-level responsiveness by reward gaps.
Extensive experiments across various benchmarks and LLMs demonstrate that DR-IRL outperforms all baseline methods in safety alignment while maintaining usefulness.
Supplementary Material: zip
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 2799
Loading