Behavioral and Strategic Deception in Large Language Models: A Taxonomy and Benchmark Analysis
Keywords: LLM deception, hallucination, strategic deception, behavioral deception, benchmark analysis, AI safety, taxonomy
TL;DR: We introduce a unified taxonomy of behavioral and strategic deception in large language models and show that existing benchmarks systematically miss key risks such as omission, pragmatic distortion, and strategic misrepresentation.
Abstract: Large language models produce outputs that systematically mislead users, from hallucinated facts and fabricated citations to sycophantic agreement and strategic deception of evaluators.
These phenomena share a common structure---the model's outputs induce false beliefs in recipients---yet they have been studied by separate communities with incompatible terminology, making it difficult to identify gaps in benchmarking, transfer mitigation strategies, or assess how current failures relate to emerging risks.
We propose a unified taxonomy organized along three dimensions: behavioral versus strategic deception (whether misleading outputs are training artifacts or instrumentally selected), objects of misrepresentation (what is misrepresented, across seven categories from factual claims to stated objectives), and mechanisms (commission, omission, or pragmatic distortion).
Applying this taxonomy to 35 benchmarks reveals that every benchmark tests commission while none targets pragmatic distortion, attribution and capability self-knowledge are under-covered, and strategic deception benchmarks remain nascent.
We use the gap analysis to prioritize risks from both current deployment and emerging capabilities, and we provide recommendations and a minimal reporting template for locating new work within the framework.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 82
Loading