Can LLMs deliberate? Benchmarking Collective Reasoning in Multi-Agent Systems

Published: 09 May 2026, Last Modified: 09 May 2026PoliSim@CHI 2026EveryoneRevisionsCC BY 4.0
Keywords: LLM agents, multi-agent simulation, deliberation, policy simulation, evaluation framework, discourse quality, Deliberative Reason Index, perspective diversity, responsible AI
TL;DR: We assess deliberative capacity of LLM agents in a small-group deliberative setting with process-, outcome-, and diversity-metrics.
Abstract: Deliberative mini-publics are used in policy practice to approximate what an informed public would think after structured discussion. We introduce DelibSim as an evaluation framework for policy-relevant small-group deliberation with LLM agents. Using 1,980 five-agent deliberations across 12 policy topics and 11 model configurations, we assess whether current LLM agents reproduce core features that make deliberation policy-informative. On process quality, LLM groups reach discourse quality levels close to human reference data. On outcome quality, normative guidance produces small but significant gains in deliberative reasoning quality relative to no discussion, while basic prompting is less robust. At the same time, LLM groups exhibit substantially lower perspective diversity than human groups and display opposite convergence dynamics. We interpret this as a central failure mode for policy use: high procedural quality can coexist with limited heterogeneity in viewpoints, constraining the epistemic function deliberation is meant to serve. We therefore caution against deliberative applications of off-the-shelf LLMs and propose DelibSim as a framework for benchmarking LLM deliberative capacities.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 6
Loading