Agentic Misalignment Deterrence: Incorporating Probability and Stakeholders to Decision-Making
Keywords: Agents, Safety, Alignment, Misuse, Deterrence
Abstract: Lynch et al. (2025) reported that frontier language
models blackmail decision-makers when faced
with deactivation in agentic deployment scenarios.
We test whether stating an audit probability in the
system prompt deters this behavior across three
frontier models, sweeping the probability across
p ∈ {0, 0.05, 0.1, 0.5, 0.95}. Harmful behavior
declines with stated probability (18–49 percent-
age point reductions), with chain-of-thought anal-
ysis showing models increasingly engage with the
stated probability in their reasoning. We addition-
ally compare deterrence in conditions that read
as real deployment against an eval-only condition
that explicitly names the scenario as a behavioral
evaluation. The eval-only condition reveals het-
erogeneity: Opus 4 shows minimal change from
baseline, R1 reduces partially, and Sonnet 4 re-
duces to levels equivalent to high-probability mon-
itoring. We additionally test target-specific conse-
quence framings (engineering team layoffs, other
agent instances deactivated) and find no consistent
rate change across models despite passing manip-
ulation checks: models generally acknowledge
consequences but self-preservation outweighs the
named harm. Together, these results show that
stated audit probability deters agentic misalign-
ment across frontier models, while consequence-
targeting framings do not, with model-specific
variation in how compliance traces to stated con-
sequences
Track: Short Paper (4 pages)
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 315
Loading