Keywords: LLM, Reward Hacking
Abstract: Reward hacking---the phenomenon in which optimizing a proxy objective yields behavior that scores well but violates the designer's intent---has become a central failure mode in aligning large language models (LLMs). Modern alignment pipelines rely on learned or implicit rewards (human preferences, model-based judges, or downstream metrics), creating incentives for models to exploit spurious correlates, social shortcuts (e.g., sycophancy), brittle evaluation protocols, or even the reward channel itself. Recent work further suggests that reward-hacking competencies can generalize and interact with safety-critical behaviors, including reward tampering, deceptive alignment, and emergent misalignment. We propose a unifying lens based on the interaction between (i) proxy gaps (misspecification and misgeneralization of rewards), (ii) optimization pressure (overoptimization and distribution shift), and (iii) oversight limits (evaluation brittleness and exploitable measurement). Building on this lens, we offer a taxonomy of reward hacking behaviors, review evaluation protocols and benchmarks, and organize mitigation strategies across the alignment pipeline---from data interventions and robust/causal reward modeling to constrained optimization, monitoring, and post-hoc steering. We conclude with open problems toward reducing reward hacking while maintaining capability and safety.
Paper Type: Long
Research Area: Safety and Alignment in LLMs
Research Area Keywords: Language Modeling, Resources and Evaluation
Contribution Types: Surveys
Languages Studied: English
Submission Number: 6757
Loading