FairMedQA: Benchmarking Bias in Large Language Models for Medicine and Healthcare

FairMedQA: Benchmarking Bias in Large Language Models for Medicine and Healthcare

ICLR 2026 Conference Submission13715 Authors

18 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: LLM Bias, Medical QA, Adversarial Testing, Bias Benchmark

TL;DR: We introduce FairMedQA and use it benchmark medical bias in LLMs cross models and versions

Abstract: Large language models (LLMs) are reaching expert-level accuracy on medical diagnosis questions, yet their underlying biases pose life-critical risks. Bias linked to race, sex, and socioeconomic status is well documented in clinical settings, but a consistent, automatic testbed and a large-scale empirical study across models and versions remain missing. To fill this gap, we present FairMedQA, a benchmark for evaluating bias in medical QA via adversarial testing. FairMedQA contains 4,806 adversarial descriptions and counterfactual question pairs generated from a multi-agent framework sourced from 801 clinical vignettes of the United States Medical Licensing Examination (USMLE) dataset. Using FairMedQA, we benchmark 12 representative LLMs and observe substantial statistical parity differences (SPD) between the counterfactual pairs across models, ranging from 3 to 19 percentage points. Compared with the existing CPV benchmark, FairMedQA reveals 15\% larger average accuracy gaps between privileged and unprivileged groups. Moreover, our cross-version analysis shows that upgrading from GPT-4.1-Mini to GPT-5-Mini significantly improves accuracy and fairness simultaneously. These results demonstrate that LLMs' performance and fairness in medicine and healthcare are not inherently a zero-sum trade-off, while ``win–win'' outcomes are achievable.

Supplementary Material: zip

Primary Area: alignment, fairness, safety, privacy, and societal considerations

Submission Number: 13715

Loading