Abstract: Hate speech detection classifiers suffer from spurious correlations between specific words and the hate class. The spurious words can be either the identity words (e.g., "black", "female", "gay") or non-identity words (e.g., "sport", "football"). The current studies mainly focus on removing spurious correlations based on predefined identity words. In this paper, we develop a novel spurious correlation mitigating strategy, called ARLHAD, without any prior knowledge of spurious words. ARLHAD leverages a minimax game for optimization between a classifier and an adversary, in which the classifier aims to improve the hate speech detection performance by minimizing the classification loss while the adversary aims to maximize the loss mainly caused by spurious words. After training, ARLHAD improves the overall performance and more importantly, alleviates the spurious correlations. Experimental results on three hate speech detection datasets show the effectiveness of ARLHAD.
Paper Type: Short
Research Area: Ethics, Bias, and Fairness
Research Area Keywords: spurious correlations; hate speech detection
Languages Studied: English;
Submission Number: 1829
Loading