Keywords: fairness, large language models, chatbots
TL;DR: A methodology for evaluating bias in open-ended chat
Abstract: Some chatbots have access to a user’s name when responding. Prior work has
shown that large language model outputs can change based on the demographic
traits correlated with a name, such as gender or race. In this study, we introduce
a scalable method for studying one form of first-person
fairness—fairness towards the user based on their demographic information—
across a large and heterogeneous corpus of actual chats. We leverage a language
model as an AI “research assistant” (AI RA) that can privately and scalably analyze
chat data, surfacing broader trends without exposing specific examples to the
researchers. We corroborate the labels of the AI RA with independent human
annotations, finding it highly consistent with human ratings of gender bias (less so
for racial bias). We apply this methodology to a large set of chats with a commercial
chatbot. We assess overall quality of responses conditional on different names and
also subtle differences in similar-quality responses that may in aggregate reinforce
harmful stereotypes based on gender or race. The largest detected biases are gender
biases in older generations of models and in open-ended tasks, like writing a story.
Finally, evaluations like ours are important for monitoring and reducing biases.
Supplementary Material: zip
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 8413
Loading