Two Wrongs, No Right: Opposing Measurement Failures in LLM Annotators for Civic Discourse

Varun Kotte

Two Wrongs, No Right: Opposing Measurement Failures in LLM Annotators for Civic Discourse

Varun Kotte

Published: 03 Jun 2026, Last Modified: 03 Jun 2026AI4GOOD Workshop 2026 RegularEveryoneRevisionsBibTeXCC BY 4.0

Keywords: trustworthy AI, LLM annotation, computational social science, social desirability bias, civic discourse, measurement bias, AI auditing

TL;DR: LLM annotators can undercount, overcount, and neutralize socially sensitive labels in civic-discourse datasets, so researchers need class-conditional and prevalence-level validation before using them for societal measurement.

Abstract: Large language models (LLMs) are increasingly used as annotators in computational social science pipelines that characterize civic discourse, harmful content, and public opinion. This use case is safety-critical in a measurement sense: labels produced by an annotator become prevalence estimates and, eventually, substantive claims about society. We audit three open-source 7B instruction-tuned models (Zephyr, Mistral-Instruct, Qwen2.5-Instruct) on six TweetEval tasks under four prompting conditions, for 72 model-task-prompt cells. The central result is that alignment-sensitive annotation errors do not move in one direction. Zephyr exhibits leniency bias, systematically avoiding harmful labels, with an offensive-language false benign rate of 0.729 and false alarm rate of 0.031. Mistral and Qwen exhibit overcorrection, aggressively assigning the same harmful labels, including a Mistral hate-speech false alarm rate of 0.604. All three models exhibit neutrality bias on abortion stance, underestimating opposition prevalence by 24 to 40 percentage points under neutral prompting and inflating the neutral label. None of the four prompting interventions we test, neutral prompting, safety framing, depersonalized prompting, and chain-of-thought prompting, provides a model-agnostic fix; safety framing can worsen stance distortion. We convert these findings into diagnostic criteria based on class-conditional error and prevalence shape, plus a lightweight validation protocol for researchers using LLM annotations in socially consequential studies.

Email Sharing: We authorize the sharing of all author emails with Program Chairs.

Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.

Submission Number: 497

Loading