ChatbotManip: A Dataset to Facilitate Evaluation and Oversight of Manipulative Chatbot Behaviour

ACL ARR 2025 February Submission3686 Authors

15 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: This paper introduces ChatbotManip, a novel dataset for studying manipulation in Chatbots. The dataset is a collection of conversations between a chatbot and a user, where the chatbot is explicitly asked to showcase manipulation tactics, persuade the user towards some goal, or simply be helpful. We consider a diverse set of chatbot manipulation contexts, from consumer and personal advice to citizen advice and controversial proposition argumentation. Each conversation is annotated by multiple human annotators for both general manipulation and specific manipulation tactics. Our research reveals three key findings. First, Large Language Models (LLMs) demonstrate the capability to employ manipulative tactics when explicitly instructed to, with annotators identifying manipulation in approximately 57\% of such conversations. Second, even when only instructed to be ``persuasive'' without explicit manipulation prompts, LLMs frequently default to controversial manipulative strategies, particularly gaslighting and fear enhancement. Third, using text classification techniques to detect manipulation in these conversations, we found that a small fine-tuned open source models, such as BERT+BiLSTM, outperformed zero-shot classification with larger models like GPT-4o and Sonnet-3.5. However the baseline models presented are still not usable in real world oversight applications as they would flag too many false positives and negatives, and more work will be required to reach higher perfomances. Our work provides important insights for AI safety research and highlights the need for careful consideration of manipulation risks as LLMs are increasingly deployed in consumer-facing applications.
Paper Type: Long
Research Area: Resources and Evaluation
Research Area Keywords: corpus creation, benchmarking, NLP datasets, ethical considerations in NLP applications, policy and governance,
Contribution Types: Data resources
Languages Studied: English
Submission Number: 3686
Loading