Agentic Misalignment Deterrence: Incorporating Probability and Stakeholders to Decision-Making

Published: 23 May 2026, Last Modified: 23 May 2026ICML 2026 AIWILDEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Agents, Safety, Alignment, Misuse, Deterrence
Abstract: Lynch et al. (2025) reported that frontier language models blackmail decision-makers when faced with deactivation in agentic deployment scenarios. We test whether stating an audit probability in the system prompt deters this behavior across three frontier models, sweeping the probability across p ∈ {0, 0.05, 0.1, 0.5, 0.95}. Harmful behavior declines with stated probability (18–49 percent- age point reductions), with chain-of-thought anal- ysis showing models increasingly engage with the stated probability in their reasoning. We addition- ally compare deterrence in conditions that read as real deployment against an eval-only condition that explicitly names the scenario as a behavioral evaluation. The eval-only condition reveals het- erogeneity: Opus 4 shows minimal change from baseline, R1 reduces partially, and Sonnet 4 re- duces to levels equivalent to high-probability mon- itoring. We additionally test target-specific conse- quence framings (engineering team layoffs, other agent instances deactivated) and find no consistent rate change across models despite passing manip- ulation checks: models generally acknowledge consequences but self-preservation outweighs the named harm. Together, these results show that stated audit probability deters agentic misalign- ment across frontier models, while consequence- targeting framings do not, with model-specific variation in how compliance traces to stated con- sequences
Track: Short Paper (4 pages)
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 315
Loading