Keywords: Reward modeling, RLHF, Fairness, Info theory, Bias mitigation, Mutual information constraint, Adversarial debiasing, Reliability, Calibration, Parity gap metric, Category invariance
TL;DR: We introduce a fairness-constrained reward modeling framework that uses mutual information minimization and curiosity-driven exploration to reduce bias, improve reliability, and enhance evaluation of RLHF systems.
Abstract: Reward misspecification in RLHF threatens the reliability of large language models by amplifying spurious correlations and producing unstable or unsafe behavior. Expert-defined harm categories provide a stable signal for post-training evaluation, but reward models often encode categorical biases that undermine trustworthiness. We address this challenge through an information-theoretic reliability objective: minimizing mutual information between reward scores and sensitive categories. Our approach enforces invariance via adversarial training while integrating curiosity-driven intrinsic rewards into PPO to preserve diversity. Framing debiasing as a minimax game yields reward models that are both robust and verifiably category-independent. Empirically, our Fair-RM achieves near-neutral bias on CrowS-Pairs and StereoSet, reduces post-PPO disparity on HH-RLHF, and scales to 19-category fairness in PKU-SafeRLHF. These results demonstrate improved calibration and stability under distribution shift, establishing our method as a practical reliability control for safety-critical RLHF deployment.
Submission Number: 152
Loading