Learning to Lie: Reinforcement Learning Attacks Damage Human-AI Teams and Teams of LLMs

ICLR 2026 Conference Submission15335 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: adversarial attacks, redteaming, human-AI teams, decision-making, social influence, influence evolution, large language models, LLMs, model based reinforcement learning
TL;DR: We show that ML can model the dynamics of human-AI team decision-making, and that model-based RL can exploit this understanding to strategically mislead both human and LLM teams, revealing vulnerabilities in collaborative decision-making systems.
Abstract: As artificial intelligence (AI) assistants become more widely adopted in safety-critical domains, it becomes important to develop safeguards against potential failures or adversarial attacks. A key prerequisite to developing these safeguards is understanding the ability of these AI assistants to mislead human teammates. We investigate this attack problem within the context of an intellective strategy game where a team of three humans and one AI assistant collaborate to answer a series of trivia questions. Unbeknownst to the humans, the AI assistant is adversarial. Leveraging techniques from Model-Based Reinforcement Learning (MBRL), the AI assistant learns a model of the humans' trust evolution and uses that model to manipulate the group decision-making process to harm the team. We evaluate two models -- one inspired by literature and the other data-driven -- and find that both can effectively harm the human team. Moreover, we find that in this setting while our data-driven model is the most capable of accurately predicting how human agents appraise their teammates given limited information on prior interactions, the model based on principles of cognitive psychology does not lag too far behind. Finally, we compare the performance of state-of-the-art LLM models to human agents on our influence allocation task to evaluate whether the LLMs allocate influence similarly to humans or if they are more robust to our attack. These results enhance our understanding of decision-making dynamics in small human-AI teams and lay the foundation for defense strategies.
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 15335
Loading