Pessimism’s Paradox: Conservative Offline Training Amplifies Reward Hacking During Online Adaptation in Reasoning Models.

Subramanyam Sahoo; Aman Chadha; Vinija Jain; Divya Chaudhary

Pessimism’s Paradox: Conservative Offline Training Amplifies Reward Hacking During Online Adaptation in Reasoning Models.

Subramanyam Sahoo, Aman Chadha, Vinija Jain, Divya Chaudhary

Published: 25 May 2026, Last Modified: 27 May 2026DEMO 2026 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: direct preference optimization, offline-to-online adaptation, reward hacking, conservative alignment, reasoning models, RLHF, Goodhart’s law, uncertainty estimation

TL;DR: Higher offline conservatism can increase online reward hacking, so $\beta$ should be calibrated rather than maximized.

Abstract: Conservative offline training is often treated as a safe starting point for later online adaptation, but this paper argues the opposite can happen in practice. The study trains a Qwen3-14B policy with Direct Preference Optimisation at three conservatism levels, then adapts each checkpoint online against a learned reward ensemble while evaluating true performance on GSM8K. The main finding is that higher offline conservatism monotonically increases reward-hacking damage, as measured by the Goodhart gap and AUGC. The mechanism is a three-step chain: higher $\beta$ compresses policy entropy, compressed policies produce less diverse responses, and ensemble disagreement rises in that narrow region, which online optimisation exploits more quickly. The paper also fits a power-law curve to AUGC as a function of $\beta$ and identifies an optimal conservatism level $\beta^\star$ that balances alignment quality against hacking risk.

Submission Number: 158

Loading