Abstract: Social bias stereotypes have recently raised significant ethical concerns in Natural Language Processing (NLP). NLP models, particularly those used for text classification, often perpetuate these biases by producing different output scores for various demographic groups, leading to discriminatory outcomes. In this paper, we conduct a comprehensive evaluation of potential social biases in a diverse array of text classification tasks, focusing on gender, race, and religion through counterfactual fairness testing. We examined 11 widely-used text classification models from Hugging Face and 3 commercial sentiment analysis models using 5 different datasets. Our findings reveal a pronounced tendency for these systems to favour certain demographic groups over others, with statistically significant biases detected. Specifically, the analysis highlights substantial disparities in how these models score identical content when demographic variables are altered, demonstrating inherent biases in the underlying models.
Loading