DEBATE: A Large-Scale Benchmark for Evaluating Opinion Dynamics in Role-Playing LLM Agents

DEBATE: A Large-Scale Benchmark for Evaluating Opinion Dynamics in Role-Playing LLM Agents

ICLR 2026 Conference Submission22060 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Role-playing LLM agent, social simulation benchmark, multi-agent LLM system, opinion dynamics

TL;DR: We introduce DEBATE, a large-scale benchmark for evaluating how well role-playing LLM agents simulate realistic human opinion dynamics in multi-agent conversations.

Abstract: Accurately modeling opinion change through social interactions is crucial for understanding and mitigating polarization, misinformation, and societal conflict. Recent work explores simulating opinion dynamics with role-playing LLM agents (RPLAs)—language models assigned human-like personas that engage in multi-turn, multi-agent opinion exchange. However, existing RPLA simulations often produce unnatural group behaviors (e.g., premature consensus) and lack empirical benchmarks for evaluating alignment with real human interactions. We introduce DEBATE, the first large-scale benchmark for evaluating the authenticity of opinion dynamics in multi-agent RPLA simulations. DEBATE contains 37,357 messages from 2,792 U.S.-based participants who engaged in multi-player, multi-round conversations across 107 controversial topics, reporting both public messages and private beliefs. We simulate these conversations using various LLMs and introduce multi-level evaluation metrics (at the utterance, individual, and group levels) to assess behavioral alignment between humans and RPLAs. Our analyses reveal key behavioral gaps: RPLA groups exhibit stronger opinion convergence and belief drift than humans, and individual agents show more systematic shifts in response to social influence. Ablation studies further highlight the importance of private self-reported opinions in shaping realistic agent behavior. Additionally, while supervised fine-tuning improves surface-level metrics (e.g., ROUGE-L, message length), it falls short on deeper alignment (e.g., semantic and stance alignment). DEBATE enables benchmarking of simulated opinion dynamics and supports future research on aligning multi-agent RPLAs' simulations with realistic human interactions. The dataset and codebase will be publicly released.

Supplementary Material: zip

Primary Area: datasets and benchmarks

Submission Number: 22060

Loading