"I'm not Racist but…": Discovering Bias in the Internal Knowledge of Large Language Models

"I'm not Racist but…": Discovering Bias in the Internal Knowledge of Large Language Models

ICLR 2024 Workshop ME-FoMo Submission73 Authors

Published: 04 Mar 2024, Last Modified: 01 May 2024ME-FoMo 2024 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Large Language Models, Natural Language Processing, Societal Bias, Stereotypes, Knowledge Representation, Bias Identification, Transparency, Fairness, Analysis Methodology

TL;DR: Large language models (LLMs) have impressive performance but reflect societal biases. Our novel method generates prompts to uncover these biases, enabling identification and mitigation of bias and promoting fairness and transparency in NLP systems.

Abstract: Large language models (LLMs) have garnered significant attention for their remarkable performance in a continuously expanding set of natural language processing tasks. However, these models have been shown to harbor inherent societal biases, or stereotypes, which can adversely affect their performance in their many downstream applications. In this paper, we introduce a novel, purely prompt-based approach to uncover hidden stereotypes within any arbitrary LLM. Our approach dynamically generates a knowledge representation of internal stereotypes, enabling the identification of biases encoded within the LLM's internal knowledge. We demonstrate how our approach can be leveraged to design targeted bias benchmarks, enabling rapid identification and mitigation of potential bias in downstream tasks. By illuminating the biases present in LLMs and offering a systematic methodology for their analysis, our work contributes to advancing transparency and promoting fairness in natural language processing systems.

Submission Number: 73

Loading