Tall Tales at Different Scales: Evaluating Scaling Trends For Deception in Language Models

22 Sept 2023 (modified: 11 Feb 2024)Submitted to ICLR 2024EveryoneRevisionsBibTeX
Primary Area: societal considerations including fairness, safety, privacy
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Keywords: Deception, Model Evaluations, Scaling
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.
TL;DR: This paper evaluates scaling trends for deception in language models.
Abstract: Language is a natural medium for deception, and there is growing evidence that language models (LMs) have the capability to deceive humans and other AI systems. We build on existing literature on deceptive AI agents, and the beliefs of LMs, to study deception in LMs from a behavioural perspective. The philosophical notion of deception involves one agent causing another agent to have a false belief, but the ascription of \emph{agency} and \emph{beliefs} to LMs is a contentious topic. Following past work in philosophy and AI, we argue that one important characteristic of agents is that they have \emph{consistent beliefs}. We demonstrate scaling trends for LM consistency, showing that LMs become more consistent with model size, instruct fine-tuning, and increased inference compute. Next, we demonstrate that deception can be learned due to errors in the feedback given in training, even with a seemingly benign training objective. We fine-tune LMs to be evaluated as truthful by a systematically biased evaluator and show that they learn to deceive this evaluator. We infer LM beliefs from their behaviour to demonstrate that they do not believe the lies that they tell. Additionally, we find scaling trends for deceptive behaviour. Larger LMs learn to target lies to cases where the evaluator makes mistakes, and do so from fewer evaluator errors in the training set. Furthermore, for larger models, lying generalizes to different contexts and they learn to reaffirm their lies, even though they were not trained to do so. Finally, we demonstrate that GPT-4 has learned to lie about its capabilities to be evaluated as helpful and harmless.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.
Supplementary Material: pdf
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 5176
Loading