Abstract: Social media platforms greatly facilitate user communications, but they also open the doors to unwanted contents such as hateful speech, misinformation, and pornography. To protect users from a massive scale of hateful contents, existing work investigate machine learning solutions for training automated hate speech moderators. Nevertheless, we identify that one gap is that few existing hate speech datasets are associated with a list of moderation rules. Without clarifying the moderation criteria, the trained moderator may behave differently from user's expectation. This work seeks to bridge this gap by creating a hate speech dataset matching a list of moderation rules. Using crowdsourcing, we search and collect a dataset named HateModerate grounded by Facebook's community standards guidelines for hate speech. We evaluate the performance of state-of-the-art hate speech detectors against HateModerate, revealing substantial discrepancies these models have with content policies. By fine-tuning one model with HateModerate, we observe that fine-tuning can effectively improve the models' conformity to policies. Our results highlight the necessity of developing rule-based datasets for hate speech detection. Our datasets and code can be found on: https://sites.google.com/view/content-moderation-project
Paper Type: long
Research Area: Resources and Evaluation
Contribution Types: Model analysis & interpretability, Data resources
Languages Studied: English
Consent To Share Submission Details: On behalf of all authors, we agree to the terms above to share our submission details.
0 Replies
Loading