The GDN-CC Dataset: Automatic Corpus Clarification for AI-enhanced Democratic Citizen Consultations

The GDN-CC Dataset: Automatic Corpus Clarification for AI-enhanced Democratic Citizen Consultations

ACL ARR 2026 January Submission6383 Authors

05 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Democratic AI, Corpus Clarification, Argument Mining

Abstract: LLMs are ubiquitous in modern NLP, and while their applicability extends to texts produced for democratic activities such as online deliberations or large-scale citizen consultations, ethical questions have been raised for their usage as analysis tools. We continue this line of research with two main goals: (a) to develop resources that can help standardize citizen contributions in public forums at the $\textbf{pragmatic level}$, and make them more amenable for, e.g., topic modeling and political analysis; (b) to study how well this standardization can reliably be performed by small, open-weights LLMs, i.e. models that can be run locally and transparently with limited resources. We accordingly introduce $\textbf{Corpus Clarification}$ as a preprocessing framework for large-scale consultation data that transforms noisy, multi-topic contributions into structured, self-contained argumentative units ready for downstream analysis. We present $\textbf{GDN-CC}$, a manually-curated dataset of 1,231 contributions to the French $\textit{Grand Débat National}$, making up 2,285 argumentative units, annotated for argumentative structure, and manually clarified. We then show that finetuned Small Language Models match or outperform LLMs on reproducing these annotations, and measure their usability for an opinion clustering task. We finally release $\textbf{GDN-CC-large}$, an automatically annotated corpus of 240k contributions, the largest annotated democratic consultation dataset to date.

Paper Type: Long

Research Area: Computational Social Science, Cultural Analytics, and NLP for Social Good

Research Area Keywords: NLP tools for social analysis

Contribution Types: NLP engineering experiment, Publicly available software and/or pre-trained models, Data resources, Data analysis

Languages Studied: French

Submission Number: 6383

Loading