A Survey of Reward Hacking in Large Language Models

A Survey of Reward Hacking in Large Language Models

ACL ARR 2026 January Submission6757 Authors

05 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: LLM, Reward Hacking

Abstract: Reward hacking---the phenomenon in which optimizing a proxy objective yields behavior that scores well but violates the designer's intent---has become a central failure mode in aligning large language models (LLMs). Modern alignment pipelines rely on learned or implicit rewards (human preferences, model-based judges, or downstream metrics), creating incentives for models to exploit spurious correlates, social shortcuts (e.g., sycophancy), brittle evaluation protocols, or even the reward channel itself. Recent work further suggests that reward-hacking competencies can generalize and interact with safety-critical behaviors, including reward tampering, deceptive alignment, and emergent misalignment. We propose a unifying lens based on the interaction between (i) proxy gaps (misspecification and misgeneralization of rewards), (ii) optimization pressure (overoptimization and distribution shift), and (iii) oversight limits (evaluation brittleness and exploitable measurement). Building on this lens, we offer a taxonomy of reward hacking behaviors, review evaluation protocols and benchmarks, and organize mitigation strategies across the alignment pipeline---from data interventions and robust/causal reward modeling to constrained optimization, monitoring, and post-hoc steering. We conclude with open problems toward reducing reward hacking while maintaining capability and safety.

Paper Type: Long

Research Area: Safety and Alignment in LLMs

Research Area Keywords: Language Modeling, Resources and Evaluation

Contribution Types: Surveys

Languages Studied: English

Submission Number: 6757

Loading