Abstract: Human perception of language depends on personal backgrounds like gender and ethnicity. While existing studies have shown that large language models (LLMs) hold values that are closer to certain societal groups, it is unclear whether their prediction behaviors on subjective NLP tasks also exhibit a similar bias. In this study, leveraging the POPQUORN dataset which contains annotations from diverse demographic backgrounds, we conduct a series of experiments on six popular LLMs to investigate their capabilities to understand demographic differences and their potential biases in predicting politeness and offensiveness. We find that for both tasks, model predictions are closer to the labels from White participants than Asian and Black participants. While we observe no significant differences between the two gender groups for most of the models for offensiveness, LLMs' predictions for politeness are significantly closer to women's ratings. We further explore prompting with specific identity information and show that including a target demographic label in the prompt does not consistently improve models' performance. Our results suggest that LLMs hold gender and racial biases for subjective NLP tasks, and that demographic-infused prompts alone may not be sufficient to mitigate such biases
Paper Type: Short
Research Area: Interpretability and Analysis of Models for NLP
Research Area Keywords: Interpretability and Analysis of Models for NLP
Contribution Types: Model analysis & interpretability, Data analysis
Languages Studied: English
Submission Number: 3437
Loading