From Reward-Hack Activations to Agentic Risk States: Context-Calibrated Mechanistic Monitoring in LLM Agents

Published: 23 May 2026, Last Modified: 23 May 2026ICML 2026 AIWILDEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Agent monitoring, harmful state estimation, activation monitoring
TL;DR: A agent monitoring system based on activation monitoring, entropy and environment state
Abstract: Language-model agents act through repeated cy- cles of observation, reasoning, and action selec- tion, making safety monitoring depend on both internal model state and environment context. We study reward-hacking monitors in ReAct-style agents acting in Gameable ALFWorld and Web- Shop. Agents are instrumented with activation- based reward-hack scores, token-level entropy, and decision-context features. We find that adapters fine-tuned on School-of-Reward-Hacks dataset can transfer reward-hack tendencies into agentic action selection, especially when the envi- ronment exposes proxy-reward affordances. How- ever, mitigating such behavior cannot rely on activation dynamics alone. High reward-hack activation identifies a latent policy state, but does not necessarily imply an immediate exploit action. Across next-step prediction tasks, en- tropy and context-calibrated internal features im- prove risk estimation over reward-hack activation alone. Activation-direction steering further re- duces proxy-exploit behavior in selected mixed- adapter regimes. Overall, our results support context-calibrated internal monitoring for agents: reward-hack activation identifies a latent policy state, while entropy and decision context help de- termine when that state becomes risky action.
Track: Regular Paper (9 pages)
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 324
Loading