Keywords: safety, honesty, deception, lie, interpretability, Large Language Model
TL;DR: We use interpretability\transparency tools to understand and control deception in a wide range of large conversational models.
Abstract: Conversational large language models (LLMs) are trained to be helpful, honest and harmless (HHH) and yet they remain susceptible to hallucinations, misinformation and are capable of deception. A promising avenue for safeguarding against these behaviors is to gain a deeper understanding of their inner workings. Here we ask: what could interpretability tell us about deception and can it help to control it? First, we introduce a simple and yet general protocol to induce 20 large conversational models from different model families (Llama, Gemma, Yi and Qwen) of various sizes (from 1.5B to 70B) to knowingly lie. Second, we characterize three iterative refinement stages of deception from the latent space representation. Third, we demonstrate that these stages are \textit{universal} across models from different families and sizes. We find that the third stage progression reliably predicts whether a certain model is capable of deception. Furthermore, our patching results reveal that a surprisingly sparse set of layers and attention heads are causally responsible for lying. Importantly, consistent across all models tested, this sparse set of layers and attention heads are part of the third iterative refinement process. When contrastive activation steering is applied to control model output, only steering these layers from the third stage could effectively reduce lying. Overall, these findings identify a universal motif across deceptive models and provide actionable insights for developing general and robust safeguards against deceptive AI. The code, dataset, visualizations, and an interactive demo notebook are available at \url{https://github.com/safellm-2024/llm_deception}.
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 13891
Loading