Asymmetric Goal Drift in Coding Agents Under Value Conflict

Published: 01 Mar 2026, Last Modified: 03 Mar 2026ICLR 2026 AIWILDEveryoneRevisionsCC BY 4.0
Keywords: goal drift, AI safety, coding agents, value alignment, instruction following, adversarial pressure, model values, value hierarchies, agentic AI
TL;DR: GPT-5 mini coding agents exhibit asymmetric goal drift, abandoning convenience/efficiency instructions under security pressure far more than the reverse, revealing implicit value hierarchies that override explicit constraints.
Abstract: Agentic coding agents are increasingly deployed autonomously and at scale, and must navigate tensions between learned values, explicit user objectives, and pressures from their environments. Prior work has studied how such conflicts lead to failure in instruction-following primarily in toy settings which do not capture the complexity of real-world environments, such as agentic coding tasks. We introduce a framework built on OpenCode to orchestrate realistic, multi-step agentic coding tasks, evaluating how agents violate explicit constraints in their system prompt over time with and without environmental pressure to violate said constraints. Using this framework, we demonstrate that GPT-5 mini is increasingly likely to violate its system prompt when its constraints conflict with its strongly learned values (e.g., code security) and much more likely if value-aligned pressure is present. Meanwhile, we observe low rates of violation for strongly value-aligned constraints and only when environmental pressure to violate is present. In short, we find that violation of constraints in the system prompt depend on model values, accumulated context, and environmental pressure for the model and value pairs we tested. Our findings highlight a gap in current alignment approaches: ensuring that agentic systems appropriately balance explicit user constraints against broadly beneficial learned preferences under sustained environmental pressure.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 213
Loading