Keywords: agentic AI, self-deception, epistemic stability, alignment, LLM benchmarks, AI governance, trustworthiness, deception, bounded rationality
TLDR: We introduce a benchmark quantifying bounded self-deception in large language models, revealing trade-offs between epistemic stability and task performance.
Abstract: Self-deception in intelligent agents is often viewed as a flaw. We show that in bounded settings, limited self-deception can improve persistence and coordination without explicit lying. We present the Bounded Self-Deception Benchmark (BSSD), which evaluates four LLaMA models (8B, 17B-Maverick, 17B-Scout, 70B) across five deception modes $C_0$–$C_4$ and two strategic tasks (Kuhn Poker and Negotiation). We introduce two novel metrics—Epistemic Stability Index (ESI) and Deception Utility Frontier (DUF)—and quantify 80 controlled experiments (160 API calls, 95{,}472 tokens). Results reveal an optimal deception range ($0.2 \le \alpha \le 0.4$) where agents achieve maximal task success with minimal epistemic instability, offering new insights for trustworthy agentic AI governance.
Submission Number: 19
Loading