$\texttt{RP-Mod}\ \&\ \texttt{RP-Crowd:}$ Moderator- and Crowd-Annotated German News Comment DatasetsDownload PDF

Published: 11 Oct 2021, Last Modified: 23 May 2023NeurIPS 2021 Datasets and Benchmarks Track (Round 2)Readers: Everyone
Keywords: Abusive Language Detection, Newspaper, Comment Moderation, Crowd Study, NLP
TL;DR: Introducing the largest annotated German dataset for comment moderation
Abstract: Abuse and hate are penetrating social media and many comment sections of news media companies. To prevent losing readers who get appalled by inappropriate texts, these platform providers invest considerable efforts to moderate user-generated contributions. This is further enforced by legislative actions, which make non-clearance of these comments a punishable action. While (semi-)automated solutions using Natural Language Processing and advanced Machine Learning techniques are getting increasingly sophisticated, the domain of abusive language detection still struggles as large non-English and well-curated datasets are scarce or not publicly available. With this work, we publish and analyse the largest annotated German abusive language comment datasets to date. In contrast to existing datasets, we achieve a high labeling standard by conducting a thorough crowd-based annotation study that complements professional moderators' decisions, which are also included in the dataset. We compare and cross-evaluate the performance of baseline algorithms and state-of-the-art transformer-based language models, which are fine-tuned on our datasets and an existing alternative, showing the usefulness for the community.
Supplementary Material: pdf
URL: https://doi.org/10.5281/zenodo.5291339
Contribution Process Agreement: Yes
Dataset Url: https://doi.org/10.5281/zenodo.5242915
License: Creative Commons Attribution Non Commercial Share Alike 4.0 International
Author Statement: Yes
20 Replies

Loading