Can LLMs deliberate? Benchmarking Collective Reasoning for Democratic AI Applications

Maurice Flechtner

Can LLMs deliberate? Benchmarking Collective Reasoning for Democratic AI Applications

Maurice Flechtner

Published: 03 Jun 2026, Last Modified: 03 Jun 2026AI4GOOD Workshop 2026 RegularEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Deliberation, AI Reasoning, AI for Civic Discourse, AI Safety, LLM, Benchmark, Multi-Agent

TL;DR: We benchmark deliberation among LLM agents on real-world political science indices and conclude that LLMs have fundamentally different deliberative reasoning patters, warning against their naive application in deliberative roles.

Abstract: Multi-agent LLM systems are increasingly proposed for democratic applications, including consensus-finding, deliberation moderation, and the representation of stakeholder perspectives. While existing benchmarks for collective reasoning in LLM systems show strong performance on verifiable tasks such as Maths problems, we argue that these benchmarks cannot assess deliberative reasoning on contested normative questions, which, however, is exactly what democratic applications demand. We introduce DelibSim, a configurable simulation environment that benchmarks multi-agent deliberation along two theoretically grounded dimensions: procedural discourse quality (AQuA) and deliberative reasoning quality (DRI). Across 1,980 five-agent deliberations spanning 11 model configurations and 12 citizen-assembly topics, LLMs qchieve discourse quality statistically indistinguishable from human deliberation (AQuA 2.94 vs. 2.98). Normative prompting yields a small but reliable improvement in shared understanding (∆DRI = 0.029, p = 0.005), but effects do not survive topic-level correction and turn negative on ethically complex topics. Most strikingly, LLM groups exhibit far lower perspective diversity than human groups (6.5 vs. 18.8) and reversed convergence dynamics: human deliberation decreases dispersion as diverse views synthesize, whereas LLM deliberation increases it. These findings expose an important failure mode for AI deployed in deliberative settings: high-quality deliberative discourse can mask fundamentally different reasoning dynamics. We release DelibSim as an open benchmark to support the responsible deployment of multiagent AI in democratic applications.

Email Sharing: We authorize the sharing of all author emails with Program Chairs.

Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.

Submission Number: 75

Loading