The Ghost Annotator: a Framework to Explore Human Label Variation in Content Moderation through Conformal Prediction

The Ghost Annotator: a Framework to Explore Human Label Variation in Content Moderation through Conformal Prediction

ACL ARR 2026 January Submission7407 Authors

06 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Bias detection, Conformal prediction, Human-model alignment, Evaluation, Uncertainty estimation

Abstract: Current research predominantly focuses on model performance while overlooking uncertainty, particularly as LLMs increasingly generate annotated data. We introduce a framework combining conformal prediction with collaborative filtering to detect LLM biases. Using Non-Conformity Scores (NCS), we introduce the Ghost Prediction metric and Ghost Annotator concept to quantify and profile cases where models diverge from all human annotations. Applying Cosine similarity measures, we identify systematic biases along sociodemographic axes. Evaluating four LLMs across four content moderation datasets we revealed that smaller LLMs tend to be more confident yet less aligned with human annotations compared to larger models, and across all models, uncertainty increases as annotator disagreement rises, mirroring collective human behavior. Finally the Ghost Annotator framework unveils strong alignment between LLMs and annotators of a specific gender on particular datasets.

Paper Type: Long

Research Area: Ethics, Bias, and Fairness

Research Area Keywords: model bias/fairness evaluation

Contribution Types: Model analysis & interpretability

Languages Studied: English

Submission Number: 7407

Loading