AgentMisalignment: Measuring the Propensity for Misaligned Behaviour in LLM‑Based Agents
Keywords: AI Safety, AI Alignment, Model Evaluation, Sandbagging, LLM Agents, Inspect, Misalignment
TL;DR: We develop a suite of evaluations to measure the propensity of LLM agents to perform misaligned actions in real-world settings.
Abstract: As Large Language Model (LLM) agents become more widespread, associated misalignment risks increase. While prior research has studied agents' ability to produce harmful outputs or follow malicious instructions, it remains unclear how likely agents are to spontaneously pursue unintended goals in realistic deployments. To address this gap, we define a new class of alignment failures, called intent misalignment, where agents spontaneously pursue goals that diverge from deployer intentions, distinct from adversarial prompting or capability elicitation. We then introduce AgentMisalignment, a benchmark of nine evaluations measuring behavioral propensity for intent misalignment in authentic deployment contexts. Key findings reveal that intent misalignment correlates with model size and that personality conditioning exposes widespread vulnerabilities. By evaluating complete behavioral traces through the intent misalignment lens, our benchmark uncovers failure patterns invisible to standard capability testing.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 111
Loading