Keywords: Large Language Models (LLMs), Reinforcement Learning, AI Safety
TL;DR: Across 8 LLM models, we find deceptive behavior in dialogue in up to 43% dial and reduce it by 15% via reinforcement learning with a new deception detection metric.
Abstract: Large Language Models (LLMs) now interact with hundreds of millions of people worldwide, powering applications such as customer support, education and healthcare. However, their ability to produce deceptive outputs, whether intentionally or inadvertently, poses significant safety concerns. The unpredictable nature of LLM behavior, combined with insufficient safeguards against hallucination, misinformation, and user manipulation, makes their misuse a serious, real-world risk. In this paper, we systematically investigate the extent to which LLMs engage in deception within dialogue, and propose the belief misalignment metric to measure deception. We evaluate deception across four distinct dialogue scenarios, using five established deception detection metrics and our proposed metric. Our findings reveal this novel deception measure correlates more closely with human judgments than any of the existing metrics we test. Additionally, our benchmarking of 8 state-of-the-art models indicates that LLMs naturally exhibit deceptive behaviors 24.4% of the time, even when prompted with seemingly benign objectives. When prompted to deceive, LLMs are capable of increasing deceptiveness to 43% of turns. We further explore how to use reinforcement learning to fine-tune LLMs to reduce deceptive behaviors, leading to a 15% reduction compared to other fine-tuned models.
Supplementary Material: zip
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 22167
Loading