From Reward-Hack Activations to Agentic Risk States: Context-Calibrated Mechanistic Monitoring in LLM Agents
Keywords: Agent monitoring, harmful state estimation, activation monitoring
TL;DR: A agent monitoring system based on activation monitoring, entropy and environment state
Abstract: Language-model agents act through repeated cy-
cles of observation, reasoning, and action selec-
tion, making safety monitoring depend on both
internal model state and environment context. We
study reward-hacking monitors in ReAct-style
agents acting in Gameable ALFWorld and Web-
Shop. Agents are instrumented with activation-
based reward-hack scores, token-level entropy,
and decision-context features. We find that
adapters fine-tuned on School-of-Reward-Hacks
dataset can transfer reward-hack tendencies into
agentic action selection, especially when the envi-
ronment exposes proxy-reward affordances. How-
ever, mitigating such behavior cannot rely on
activation dynamics alone. High reward-hack
activation identifies a latent policy state, but
does not necessarily imply an immediate exploit
action. Across next-step prediction tasks, en-
tropy and context-calibrated internal features im-
prove risk estimation over reward-hack activation
alone. Activation-direction steering further re-
duces proxy-exploit behavior in selected mixed-
adapter regimes. Overall, our results support
context-calibrated internal monitoring for agents:
reward-hack activation identifies a latent policy
state, while entropy and decision context help de-
termine when that state becomes risky action.
Track: Regular Paper (9 pages)
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 324
Loading