ModelCitizens: Representing Community Voices in Online Safety

ModelCitizens: Representing Community Voices in Online Safety

ACL ARR 2025 May Submission4375 Authors

19 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Automatic toxic language detection is important for creating safe, inclusive online spaces. However, it is a highly subjective task, with perceptions of toxic language shaped by community norms and lived experience. Existing toxicity detection models are typically trained on annotations that collapse diverse annotator perspectives into a single ground truth, erasing important context-specific notions of toxicity such as reclaimed language. To address this, we introduce MODELCITIZENS, a dataset of $6.8K$ social media posts and $40K$ toxicity annotations across diverse identity groups. To reflect the impact of conversational context on toxicity, typical of social media posts, we augment MODELCITIZENS posts with LLM-generated conversational scenarios. State-of-the-art toxicity detection tools (e.g. OpenAI Moderation API, GPT-o4-mini) underperform on MODELCITIZENS with further degradation on context-augmented posts. Finally, we release LLAMACITIZEN-8B and GEMMACITIZEN-12B, LLaMA and Gemma-based models finetuned on our dataset, which outperform GPT-o4-mini by 5.5% on in-distribution evaluations. Our findings highlight the importance of community-informed annotation and modeling for inclusive content moderation. We will release all code, data and models upon publication.

Paper Type: Long

Research Area: Ethics, Bias, and Fairness

Research Area Keywords: data ethics; model bias/fairness evaluation

Contribution Types: NLP engineering experiment, Data resources, Data analysis

Languages Studied: English

Submission Number: 4375

Loading