Reward Bias Substitution: Single-Axis Mitigations Shift Optimization Pressure

Max Lamparth; Daniel Fein; Andreas Haupt; Marcel Hussing; Mykel Kochenderfer

Reward Bias Substitution: Single-Axis Mitigations Shift Optimization Pressure

Max Lamparth, Daniel Fein, Andreas Haupt, Marcel Hussing, Mykel Kochenderfer

Published: 23 May 2026, Last Modified: 23 May 2026ACM CAIS 2026: RLEval Workshop OralEveryoneRevisionsBibTeXCC BY 4.0

Keywords: RLHF, reward hacking, preference learning, evaluations, reward modeling, theory

Abstract: Single-axis mitigations of reward-model biases (e.g., reducing reliance of the proxy reward on length, sycophancy, or style) can rotate optimization pressure onto correlated proxies rather than eliminate it, a failure mode we call reward bias substitution. We formalize the underlying measurement-vs-optimization gap between the audit distributions where mitigations are validated and the policy distributions where optimization realizes their effects. We introduce a taxonomy, instantiated in closed form, classifying single-axis mitigation outcomes into successful mitigation, bias substitution, overcorrection, silent non-op, and audit-distribution sensitivity. We prove that single-axis mitigation methods cannot be validated by audit-distribution-only evaluation: successful mitigation, bias substitution, and overcorrection produce structurally identical observables under ranking accuracy and win-rate scoring, regardless of benchmarks richness. Augmenting evaluation with policy-induced distributions while tracking multiple biases provably closes the gap. We give actionable prescriptions for mitigation methods and benchmarks. Across published preference-learning mitigation work, no method we survey reports the evidence needed to certify successful mitigation. We demonstrate bias substitution in language model RLHF, where a length penalty during GRPO training compresses responses as intended yet redirects optimization pressure onto confidence calibration, driving the trained policy into overconfidence. Our experiments also show that a published length-debiasing operator zeros pooled reward–length correlation but flips sign within-prompt on three of four SOTA reward models with true reward degrading on two, and that length–sycophancy coupling reverses under human–LLM judge disagreement across eight model families.

Email Sharing: We authorize the sharing of all author emails with Program Chairs.

Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.

Submission Number: 23

Loading