Keywords: LLM Bias, Medical QA, Adversarial Testing, Bias Benchmark
TL;DR: We introduce FairMedQA and use it benchmark medical bias in LLMs cross models and versions
Abstract: Large language models (LLMs) are reaching expert-level accuracy on medical diagnosis questions, yet their underlying biases pose life-critical risks. Bias linked to race, sex, and socioeconomic status is well documented in clinical settings, but a consistent, automatic testbed and a large-scale empirical study across models and versions remain missing. To fill this gap, we present FairMedQA, a benchmark for evaluating bias in medical QA via adversarial testing. FairMedQA contains 4,806 adversarial descriptions and counterfactual question pairs generated from a multi-agent framework sourced from 801 clinical vignettes of the United States Medical Licensing Examination (USMLE) dataset. Using FairMedQA, we benchmark 12 representative LLMs and observe substantial statistical parity differences (SPD) between the counterfactual pairs across models, ranging from 3 to 19 percentage points. Compared with the existing CPV benchmark, FairMedQA reveals 15\% larger average accuracy gaps between privileged and unprivileged groups. Moreover, our cross-version analysis shows that upgrading from GPT-4.1-Mini to GPT-5-Mini significantly improves accuracy and fairness simultaneously. These results demonstrate that LLMs' performance and fairness in medicine and healthcare are not inherently a zero-sum trade-off, while ``win–win'' outcomes are achievable.
Supplementary Material: zip
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 13715
Loading