Localizing Lying in Llama: Understanding Instructed Dishonesty on True-False Questions Through Prompting, Probing, and Patching

Published: 23 Oct 2023, Last Modified: 28 Nov 2023SoLaR PosterEveryoneRevisionsBibTeX
Keywords: Interpretability, Dishonesty, Deception, Truthfulness
TL;DR: We prompt Llama-2-70B-chat to lie and understand the mechanisms involved using mechanistic interpretability techniques.
Abstract: Large language models (LLMs) demonstrate significant knowledge through their outputs, though it is often unclear whether undesirable outputs are due to a lack of knowledge or dishonesty. In this paper, we conduct an extensive study of intentional dishonesty in Llama-2-70b-chat by engineering prompts that instruct it to lie and then use mechanistic interpretability approaches to localize where in the network this lying behavior occurs. We consistently find five layers in the model that are highly important for lying using three independent methodologies (probing, patching, and concept erasure). We then successfully perform causal interventions on only 46 attention heads (or less than 1\% of all heads in the network), causing the lying model to act honestly. These interventions work robustly across four prompts and six dataset splits. We hope our work can help understand and thus prevent lying behavior in LLMs.
Submission Number: 124
Loading