Can AI Truly Represent Your Voice in Deliberations? A Comprehensive Study of Large-Scale Opinion Aggregation with LLMs
Keywords: Human Study; Reliable LLM; Public Deliberation; Computational Social Science; Large-Scale Evaluation
Abstract: Large-scale public deliberations generate thousands of free-form contributions that must be synthesized into representative and neutral summaries for policy use. While LLMs have been shown as a promising tool to generate summaries for large-scale deliberations, they also risk underrepresenting minority perspectives, raising fairness concerns in high-stakes contexts.
Studying and fixing these issues requires a comprehensive evaluation at a large scale, yet current practice often relies on LLMs as judges, which show weak alignment with human judgments. We introduce DeliberationBank, a large-scale, human-grounded benchmark for deliberation summarization that contains (1) 3,000 participant-generated opinions across ten deliberation questions and (2) 4,500 human annotations evaluating summaries along four dimensions: representativeness, informativeness, neutrality, and policy approval. Using this benchmark, we train DeliberationJudge, a domain-aligned evaluator that provides more reliable and efficient assessments than general-purpose LLM judges. With this evaluation framework, we benchmark 18 LLM summarizers and uncover consistent weaknesses, including systematic underrepresentation of minority viewpoints. Our benchmark and evaluator offer a scalable and reliable foundation for assessing deliberation summarization systems, supporting the development of more representative, equitable, and policy-relevant AI tools.
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 1768
Loading