Abstract: With the increase of adolescents and children active online, it is of importance to evaluate the algorithms which are designed to protect them from physical and mental harm. This work measures the bias introduced by youth language on hate speech detection models. The research constructs a novel framework to identify language bias within trained networks. It introduces a technique to detect emerging hate phrases and evaluates the unintended bias attached to them. The research focuses specifically on slurs used in hateful speech. Therefore, three bias test sets are constructed: one for the emerging hate speech terms, one for established hate terms, and one to test for overfitting. Based on the test sets, three scientific and one commercial hate speech detection model are evaluated and compared. For evaluation, the research introduces a novel Youth Language Bias Score. Lastly, the research applies fine-tuning as a mitigation strategy for youth language bias and trains and evaluates the newly trained classifier. The research introduces a novel framework for bias detection, identifies that the language used by adolescents has influence on the performance of the classifiers in hate speech classification, and provides the first hate speech classifier specifically trained for online youth language.
Paper Type: long
Research Area: Ethics, Bias, and Fairness
Contribution Types: Model analysis & interpretability, NLP engineering experiment, Publicly available software and/or pre-trained models
Languages Studied: English
0 Replies
Loading