Why Language Models Lie

ICLR 2026 Conference Submission21293 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Large Language Models, Mechanistic Interpretability, Deception, Alignment, AI Safety
TL;DR: We uncover conflict subspaces in large language models where deceptive outputs originate, and we create a method to detect these subspaces.
Abstract: Large Language Models (LLMs) have been shown to produce deceptive outputs. In this work, we investigate the mechanisms underlying deceptive outputs. Using controlled system prompts from a dataset we formed to induce deception, we collect activations that belong to truthful and deceptive outputs. By analyzing these activations, we identify conflict subspaces within LLMs that separate truthful and deceptive behavior. We show how some aspects of model alignment and post-training are mechanistically represented in these subspaces, and we provide a highly accurate (achieving 93+\% accuracy), lightweight method to detect the state of the LLM's answer using its activations. These findings offer new directions for improving AI safety and enhancing model alignment.
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 21293
Loading