Track: Ideas, Open Problems and Positions Track
Keywords: rlhf, alignment, reward models, info theory, fairness, reward hacking, bias
Abstract: Reward misspecification in RLHF creates a critical gap between theoretical RL guarantees and practical deployment, as empirical reward models amplify spurious correlations that violate theoretical alignment assumptions. Expert-defined harm categories provide ground truth for bridging this theory–practice divide, yet learned reward models often encode categorical biases that undermine convergence properties. We take the position that fairness constraints—operationalized as minimizing mutual information between reward scores and sensitive categories—should be treated as a theoretical reliability principle for RLHF reward models. This framing translates invariance guarantees into adversarial training while integrating curiosity-driven intrinsic rewards into PPO to preserve exploration–exploitation balance. Our experiments show near-neutral bias on CrowS-Pairs and StereoSet, reduced post-PPO disparity on HH-RLHF, and improved fairness across 19 categories in PKU-SafeRLHF, demonstrating feasibility of this approach. We conclude with open challenges in extending beyond discrete categories, analyzing reward-hacking dynamics, and scaling adversarial objectives to larger models, positioning fairness not as an auxiliary constraint but as a core bridge between theoretical RL desiderata and practical deployment.
Submission Number: 119
Loading