Cross-Care: Assessing the Healthcare Implications of Pre-training Data on Language Model Bias

Shan Chen; Jack Gallifant; Mingye Gao; Pedro José Ferreira Moreira; Nikolaj Munch; Ajay Muthukkumar; Arvind Rajan; Jaya Kolluri; Amelia Fiske; Janna Hastings; Hugo Aerts; Brian W. Anthony; Leo Anthony Celi; William La Cava; Danielle Bitterman

Cross-Care: Assessing the Healthcare Implications of Pre-training Data on Language Model Bias

Shan Chen, Jack Gallifant, Mingye Gao, Pedro José Ferreira Moreira, Nikolaj Munch, Ajay Muthukkumar, Arvind Rajan, Jaya Kolluri, Amelia Fiske, Janna Hastings, Hugo Aerts, Brian W. Anthony, Leo Anthony Celi, William La Cava, Danielle Bitterman

Published: 26 Sept 2024, Last Modified: 17 Jan 2025NeurIPS 2024 Track Datasets and Benchmarks PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: llm, healthcare, bias

TL;DR: Cross-Care evaluates biases in language models, highlighting significant discrepancies between model logits, pretraining data and real-world disease prevalence, emphasizing the need for integrating real-world data to ensure fairness.

Abstract: Large language models (LLMs) are increasingly essential in processing natural languages, yet their application is frequently compromised by biases and inaccuracies originating in their training data. In this study, we introduce \textbf{Cross-Care}, the first benchmark framework dedicated to assessing biases and real world knowledge in LLMs, specifically focusing on the representation of disease prevalence across diverse demographic groups. We systematically evaluate how demographic biases embedded in pre-training corpora like $ThePile$ influence the outputs of LLMs. We expose and quantify discrepancies by juxtaposing these biases against actual disease prevalences in various U.S. demographic groups. Our results highlight substantial misalignment between LLM representation of disease prevalence and real disease prevalence rates across demographic subgroups, indicating a pronounced risk of bias propagation and a lack of real-world grounding for medical applications of LLMs. Furthermore, we observe that various alignment methods minimally resolve inconsistencies in the models' representation of disease prevalence across different languages. For further exploration and analysis, we make all data and a data visualization tool available at: \url{www.crosscare.net}.

Submission Number: 864

Loading