Revealing the Bias in Large Language Models via Reward Structured QuestionsDownload PDF

05 Oct 2022 (modified: 05 May 2023)FMDM@NeurIPS2022Readers: Everyone
Abstract: The success of the large language models have been utterly demonstrated in the recent time. Using these models and fine tuning for the specific task at hand results in highly performing models. However, these models also learn biased representations from the data they have been trained on. In particular, several studies recently showed that language models can learn to be biased towards certain genders. Quite recently, several studies tried to eliminate this bias via proposing human feedback included in fine-tuning. In our study we show that by changing the question asked to the language model the log probabilities of the bias measured in the responses changes dramatically. Furthermore, in several cases the language model ends up providing a completely opposite response. The recent language models finetuned on the prior gender bias datasets do not resolve the actual problem, but rather alleviates the problem for the dataset on which the model is fine-tuned. We believe our results might lay the foundation for further alignment and safety problems in large language models.
0 Replies