Pessimism’s Paradox: Conservative Offline Training Amplifies Reward Hacking During Online Adaptation in Reasoning Models.
Keywords: direct preference optimization, offline-to-online adaptation, reward hacking, conservative alignment, reasoning models, RLHF, Goodhart’s law, uncertainty estimation
TL;DR: Higher offline conservatism can increase online reward hacking, so $\beta$ should be calibrated rather than maximized.
Abstract: Conservative offline training is often treated as a safe starting point for later online adaptation, but this paper argues the opposite can happen in practice. The study trains a Qwen3-14B policy with Direct Preference Optimisation at three conservatism levels, then adapts each checkpoint online against a learned reward ensemble while evaluating true performance on GSM8K. The main finding is that higher offline conservatism monotonically increases reward-hacking damage, as measured by the Goodhart gap and AUGC. The mechanism is a three-step chain: higher $\beta$ compresses policy entropy, compressed policies produce less diverse responses, and ensemble disagreement rises in that narrow region, which online optimisation exploits more quickly. The paper also fits a power-law curve to AUGC as a function of $\beta$ and identifies an optimal conservatism level $\beta^\star$ that balances alignment quality against hacking risk.
Submission Number: 158
Loading