Keywords: RLHF, reward hacking, preference learning, evaluations, reward modeling, theory
Abstract: Single-axis mitigations of reward-model biases (e.g., reducing reliance of the proxy reward on length, sycophancy, or style) can rotate
optimization pressure onto correlated proxies rather than eliminate it, a failure mode we call reward bias substitution. We formalize
the underlying measurement-vs-optimization gap between the audit distributions where mitigations are validated and the policy
distributions where optimization realizes their effects. We introduce a taxonomy, instantiated in closed form, classifying single-axis
mitigation outcomes into successful mitigation, bias substitution, overcorrection, silent non-op, and audit-distribution sensitivity.
We prove that single-axis mitigation methods cannot be validated by audit-distribution-only evaluation: successful mitigation, bias
substitution, and overcorrection produce structurally identical observables under ranking accuracy and win-rate scoring, regardless of
benchmarks richness. Augmenting evaluation with policy-induced distributions while tracking multiple biases provably closes the gap.
We give actionable prescriptions for mitigation methods and benchmarks. Across published preference-learning mitigation work,
no method we survey reports the evidence needed to certify successful mitigation. We demonstrate bias substitution in language
model RLHF, where a length penalty during GRPO training compresses responses as intended yet redirects optimization pressure onto
confidence calibration, driving the trained policy into overconfidence. Our experiments also show that a published length-debiasing
operator zeros pooled reward–length correlation but flips sign within-prompt on three of four SOTA reward models with true reward
degrading on two, and that length–sycophancy coupling reverses under human–LLM judge disagreement across eight model families.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 23
Loading