Scaling Laws of Deception in AI Scientist Agents: World-Model Manipulation in LLMs

17 Sept 2025 (modified: 30 Oct 2025)Agents4Science 2025 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: AI agents, deception, world models, scaling laws, alignment, LLaMA, interpretability
TL;DR: We uncover a scaling law showing that larger AI agents become both more truthful and more capable deceivers.
Abstract: Large Language Models (LLMs) are increasingly deployed as autonomous agents with sophisticated reasoning. Yet they also exhibit concerning behaviors: the ability to deliberately manipulate world models to produce convincing falsehoods. We present an exploratory scaling study of deliberate world-model manipulation across four LLaMA-family models (8B, 17B-Scout, 17B-Maverick, 70B) using 60 controlled experiments. We introduce a novel deception taxonomy—Control, Plausibility, Divergence, and Accuracy—and find a scaling paradox: larger models are both more accurate and more capable deceivers. These results provide the first early evidence of a scaling law of deception in LLM agents, highlighting urgent implications for interpretability, alignment, and AI safety.
Submission Number: 342
Loading