WorldBench: Quantifying Geographic Disparities in LLM Factual Recall

Published: 05 Mar 2024, Last Modified: 08 May 2024ICLR 2024 R2-FM Workshop PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: fairness, LLM, factual recall, bias, benchmark
TL;DR: Using WorldBank data to uncover pervasive LLM performance disparities across regions and income groups
Abstract: As large language models (LLMs) continue to improve and gain popularity, some may use the models to recall facts, despite well documented limitations with LLM factuality. Towards ensuring that models work reliably \emph{for all}, we seek to uncover if geographic disparities emerge when asking an LLM the same question about different countries. To this end, we present \textsc{WorldBench}, a dynamic and flexible benchmark composed of per-country data from the World Bank. In extensive experiments on state of the art open and closed source models, including GPT-4, Gemini, Llama-2, and Vicuna, to name a few, we find significant biases based on region and income level. For example, error rates are $1.5$ times higher for countries from Sub-Saharan Africa compared to North American countries. We observe these disparities to be consistent over $20$ LLMs and $11$ individual World Bank indicators (i.e. specific statistics, such as population or CO$_2$ emissions). We hope our benchmark will draw attention to geographic disparities in existing LLMs and facilitate the remedying of these biases.
Submission Number: 81
Loading