Keywords: rlhf, alignmnet, reward models, info theory, fairness, reward hacking, bias
TL;DR: We use adversarial training to debias RLHF reward models by minimizing mutual information between rewards and sensitive categories, achieving near-neutral bias while preserving alignment.
Abstract: Reward misspecification in RLHF threatens the reliability of large language models by amplifying spurious correlations and producing unstable or unsafe behavior. Expert-defined harm categories provide a stable signal for post-training evaluation, but reward models often encode categorical biases that undermine trustworthiness. We address this challenge through an information-theoretic reliability objective: minimizing mutual information between reward scores and sensitive categories. Our approach enforces invariance via adversarial training while integrating curiosity-driven intrinsic rewards into PPO to preserve diversity. Framing debiasing as a minimax game yields reward models that are both robust and verifiably category-independent. Empirically, our Fair-RM achieves near-neutral bias on CrowS-Pairs and StereoSet, reduces post-PPO disparity on HH-RLHF, and scales to 19-category fairness in PKU-SafeRLHF. These results demonstrate improved calibration and stability under distribution shift, establishing our method as a practical reliability control for safety-critical RLHF deployment.
Submission Number: 164
Loading