CReHate: Cross-cultural Re-annotation of English Hate Speech DatasetDownload PDFOpen Website

Published: 01 Jan 2023, Last Modified: 18 Dec 2023CoRR 2023Readers: Everyone
Abstract: Most NLP datasets neglect the cultural diversity among language speakers, resulting in a critical shortcoming in hate speech detection and other culturally sensitive tasks. To address this, we introduce CREHate, a CRoss-cultural English Hate speech dataset. To construct CREHate, we follow a two-step procedure: 1) culture-specific post collection and 2) cross-cultural annotation. We sample posts from the SBIC dataset, which predominantly represents North America, and collect posts from four geographically diverse English-speaking countries using culture-specific hate speech keywords that we retrieve from our survey. Annotations are then collected from those four English-speaking countries plus the US to establish representative labels for each country. Our analysis highlights statistically significant disparities in cross-cultural hate speech annotations. Only 56.2% of the posts in CREHate achieve consensus among all five countries, with a peak pairwise disagreement rate of 26%. The annotations show that label disagreements tend to come from the inherent cultural context, subjectivity, and ambiguity of the posts. Lastly, we develop cross-cultural hate speech classifiers that are more accurate at predicting each country's labels than the monocultural classifiers. This confirms the utility of CREHate for constructing culturally sensitive hate speech classifiers.
0 Replies

Loading