Behavioral and Strategic Deception in Large Language Models: A Taxonomy and Benchmark Analysis
Track: Main Papers Track (6 to 9 pages)
Keywords: LLM deception, hallucination, strategic deception, behavioral deception, benchmark analysis, AI safety, taxonomy
TL;DR: We introduce a unified taxonomy of behavioral and strategic deception in large language models and show that existing benchmarks systematically miss key risks such as omission, pragmatic distortion, and strategic misrepresentation.
Abstract: Large language models produce outputs that systematically mislead users, from hallucinated facts and fabricated citations to sycophantic agreement and strategic deception of evaluators.
These phenomena share a common structure---the model's outputs induce false beliefs in recipients---yet they have been studied by separate communities with incompatible terminology, making it difficult to identify gaps in benchmarking, transfer mitigation strategies, or assess how current failures relate to emerging risks.
We propose a unified taxonomy organized along three dimensions: behavioral versus strategic deception (whether misleading outputs are training artifacts or instrumentally selected), objects of misrepresentation (what is misrepresented, across seven categories from factual claims to stated objectives), and mechanisms (commission, omission, or pragmatic distortion).
Applying this taxonomy to 35 benchmarks reveals that every benchmark tests commission while none targets pragmatic distortion, attribution and capability self-knowledge are under-covered, and strategic deception benchmarks remain nascent.
We use the gap analysis to prioritize risks from both current deployment and emerging capabilities, and we provide recommendations and a minimal reporting template for locating new work within the framework.
Anonymization: This submission has been anonymized for double-blind review via the removal of identifying information such as names, affiliations, and identifying URLs.
Submission Number: 36
Loading