Keywords: persuasion, vigilance, LLMs, cognition, reasoning, games
TL;DR: Quantifying Persuasion and Vigilance in Large Language Models
Abstract: With increasing integration of Large Language Models (LLMs) into areas of high-stakes human decision-making, e.g., medicine and finance, it is important to understand LLMs' social capacities, such as persuasion and vigilance. Yet there is a dearth of existing paradigms which allow researchers to examine models' social capacities in a manner that is simultaneously tractable (i.e., permits quantification and rational analysis), scalable (i.e., can be used to examine models of arbitrary intelligence) and rich (i.e., naturally captures multi-turn interactions). This gap has limited our understanding of LLM social capacities to high-level observations rather than detailed capability evaluations. We propose using Sokoban, a multi-turn puzzle-solving game composed of actionable, fixed states that can be made arbitrarily complex and precisely evaluated, to examine how LLMs compose persuasive arguments that both assist and mislead players, and how vigilant LLMs are in ignoring malicious advice when acting as players. Surprisingly, we find that puzzle-solving performance, persuasive capability, and vigilance are dissociable capacities in LLMs. Performing well on the game does not automatically mean a model can detect when it is being misled, even if the possibility of deception is explicitly mentioned. However, LLMs do consistently modulate their token use, using fewer tokens to reason when advice is benevolent and more when it is malicious, even if they are still persuaded to take actions leading them to failure. To our knowledge, our work presents the first investigation of the relationship between persuasion, vigilance, and task performance, and suggests that monitoring all three independently will be critical for future work in AI safety.
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 22869
Loading