Generating Explanations of Stereotypical Biases with Large Language Model
Abstract: Existing studies investigate stereotypical biases in large language models (LLMs) through the difference between real-world and counterfactual data. In this case, real-world data typically exhibit pro-stereotypical bias, while counterfactual data rewritten by humans exhibit anti-stereotypical bias. Due to the subjective nature of stereotypical bias judgment, it is crucial to explain the judgment. In this study, we aim to use LLMs to judge whether a sentence is pro- or anti-stereotypical and explain the reason for the judgment. We construct a stereotypical bias explanation dataset for this goal. The experimental results show that LLMs outperform humans in distinguishing pro- and antistereotypical biases. Moreover, our constructed dataset is highly effective in training smaller language models to generate high-quality explanations. Finally, we find that LLMs differ from human annotations on counterfactual data than on real-world data.
Loading