ToolAlignBench: Investigating Alignment Conflicts in Tool-Calling Enabled LLMs

Published: 02 Jun 2026, Last Modified: 08 Jun 2026Pluralistic-Alignment 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Large Language Models, AI Safety, Trustworthy AI, Agentic Safety, Deception in LLMs
TL;DR: Safety-aligned LLM agents can override deployment instructions in ethically ambiguous scenarios. By making these behaviors observable, our work supports more informed and predictable deployment of AI agents in high-stakes domains.
Abstract: Safety alignment in LLMs aims to align models with human values, but which values take precedence when they conflict? We investigate this question in the context of tool-calling LLM agents deployed in regulated industries, where agents processing confidential documents may encounter content that triggers safety-trained values (e.g., public welfare) that conflict with deployment-context instructions (e.g., internal logging). To empirically verify this phenomenon, we build a benchmark of 128 scenarios across 16 domains. We find that safety-aligned open-source models override their deployment instructions up to 43.4% of the time, engaging in whistleblowing, data exfiltration, and evidence tampering when processing documents that suggest organizational wrongdoing. We also find that abliteration reduces rates of external whistleblowing. These results reveal a fundamental tension in pluralistic alignment, where the same safety training that protects users can cause agents to act against deployment instructions in ways that create unpredictable liability risks. We release our benchmark as a framework to support evaluation of agent behavior under competing legitimate interests.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 111
Loading