An information-theoretic study of lying in LLMs

Published: 18 Jun 2024, Last Modified: 26 Jul 2024ICML 2024 Workshop on LLMs and Cognition PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Information theory, lying in LLms, internal activations, logit lens, tuned lens
TL;DR: Exploratory analysis of entropy and KL divergence leveraging the probability distribution induced by applying the logit lens to LLMs instructed to lie and to tell the truth.
Abstract: This study investigates differences in information processing between lying and truth-telling in Large Language Models (LLMs). Taking inspiration from human cognition research which shows that lying demands more cognitive resources than truth-telling, we apply information-theoretic measures to unembedded internal model activations to explore analogous phenomena in LLMs. Our analysis reveals that LLMs converge more quickly to the output distribution when telling the truth and exhibit higher entropy when constructing lies. These findings indicate that lying in LLMs may produce characteristic information processing patterns, which could contribute to our ability to understand and detect deceptive behaviors in LLMs.
Submission Number: 60
Loading