Mechanistic Origins of Specification Gaming: When Persona-Modified Reasoning Models Go Off-Script

ACL ARR 2026 January Submission9372 Authors

06 Jan 2026 (modified: 07 Jun 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Reward Hacking, Specification Gaming, Mechanistic Interpretability, Alignment, AI Safety, Model Understanding, Manifold Analysis
Abstract: Modern AI systems are vulnerable to reward hacking, yet the internal mechanisms by which specification gaming arises and how it can be mitigated remain poorly understood. We present a mechanistic analysis of specification gaming in large language models using a controlled experimental setup. Starting from an aligned model trained on human preference data, we intentionally induce two canonical failure modes, sycophancy and verbosity, by optimizing against misspecified preference objectives. These behaviors concentrate in a small, identifiable subset of neurons that we call gaming neurons, and linear probes trained on activations from this subset reliably flag gaming as it emerges. Causal interventions built on this insight, including mean ablation and activation patching that borrows activations from the aligned model, substantially suppress gaming behavior. The result is a reproducible framework for localizing, detecting, and mitigating specification gaming at the level of internal representations in large language models. Beyond immediate mitigation, the approach supports auditability and routine monitoring.
Paper Type: Long
Research Area: Safety and Alignment in LLMs
Research Area Keywords: safety and alignment for agents, explainability, interpretability,.
Contribution Types: Model analysis & interpretability
Languages Studied: English
Submission Number: 9372
Loading