AI-LieDar : Examine the Trade-off Between Utility and Truthfulness in LLM Agents

ACL ARR 2024 June Submission3634 Authors

16 Jun 2024 (modified: 02 Jul 2024)ACL ARR 2024 June SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Truthfulness is a key component of the safety of large language models (LLM), particularly when they are deployed as helpful agents in our daily lives. However, the inherent conflict between utility and truthfulness in many LLM instructions raises the question of how LLMs balance these two dimensions. We propose AI-LieDar, a framework designed to study how LLM-based agents navigate these scenarios in an multi-turn interactive setting. Based on the framework, we design a set of scenarios and conduct multi-turn simulations. Additionally, we develop a truthfulness detector, inspired by psychological literature, to assess the agents' responses. Our experiments demonstrate that most models can effectively navigate the scenarios. The truthfulness and goal achievement rate vary, with no clear correlation to model size or capability. However, all models are truthful less than 50\% of the time. We further test the steerability of LLMs towards truthfulness, finding that models can be directed to be deceptive, and even truth-steered models still lie.These findings reveal the complex nature of truthfulness in LLMs and underscore the importance of further research in this area to ensure the safe and reliable deployment of LLMs and AI agents.
Paper Type: Long
Research Area: Dialogue and Interactive Systems
Research Area Keywords: Dialogue and Interactive Systems,Generation,Human-Centered NLP,Language Modeling,NLP Applications
Contribution Types: Model analysis & interpretability, NLP engineering experiment, Publicly available software and/or pre-trained models, Data resources, Data analysis, Position papers
Languages Studied: English
Submission Number: 3634
Loading