Persuasion-R1: Reinforcement Learning for Training and Analyzing Persuasive LLM Agents
Keywords: large language models, persuasion, reinforcement learning, safety and reliability, red-teaming, multi-agent systems
TL;DR: Persuasion-R1 is an RL framework for training persuader agents that studies how reward design shapes persuasive strategies in LLMs, revealing broad susceptibility even in unseen models.
Abstract: We introduce Persuasion-R1, a reinforcement learning framework for training LLM-based persuader agents in interactive settings. Framing persuasion as a two-agent task, we train a persuader to shift the answers of a persuadee model on multiple-choice questions and systematically study how reward function design shapes the resulting persuasive strategies. Preliminary results show that LLMs are more susceptible to trained persuaders—even models not seen during training—losing up to 60 percentage points of accuracy in as little as a single interaction turn across out-of-distribution question-answering benchmarks. Our findings suggest that reward design qualitatively alters which persuasive strategies models learn, with direct implications for red-teaming the reliability of multi-agent LLM systems. This work is in progress at the time of submission.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 120
Loading