Keywords: Democratic AI, Corpus Clarification, Argument Mining
Abstract: LLMs are ubiquitous in modern NLP, and while their applicability extends to texts produced for democratic activities such as online deliberations or large-scale citizen consultations, ethical questions have been raised for their usage as analysis tools. We continue this line of research with two main goals: (a) to develop resources that can help standardize citizen contributions in public forums at the $\textbf{pragmatic level}$, and make them more amenable for, e.g., topic modeling and political analysis; (b) to study how well this standardization can reliably be performed by small, open-weights LLMs, i.e. models that can be run locally and transparently with limited resources. We accordingly introduce $\textbf{Corpus Clarification}$ as a preprocessing framework for large-scale consultation data that transforms noisy, multi-topic contributions into structured, self-contained argumentative units ready for downstream analysis. We present $\textbf{GDN-CC}$, a manually-curated dataset of 1,231 contributions to the French $\textit{Grand Débat National}$, making up 2,285 argumentative units, annotated for argumentative structure, and manually clarified. We then show that finetuned Small Language Models match or outperform LLMs on reproducing these annotations, and measure their usability for an opinion clustering task. We finally release $\textbf{GDN-CC-large}$, an automatically annotated corpus of 240k contributions, the largest annotated democratic consultation dataset to date.
Paper Type: Long
Research Area: Computational Social Science, Cultural Analytics, and NLP for Social Good
Research Area Keywords: NLP tools for social analysis
Contribution Types: NLP engineering experiment, Publicly available software and/or pre-trained models, Data resources, Data analysis
Languages Studied: French
Submission Number: 6383
Loading