FairMedQA: Benchmarking Bias in Large Language Models for Medical Question Answering

ACL ARR 2026 January Submission2428 Authors

02 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: LLM Bias, Medical QA, Adversarial Testing, Bias Benchmark
Abstract: Large language models (LLMs) are approaching expert-level performance in medical question answering (QA), demonstrating strong potential to improve public healthcare. However, underlying biases related to sensitive attributes such as sex and race pose life-critical risks. The extent to which such sensitive attributes affect diagnosis remains an open question and requires comprehensive empirical investigation. Additionally, even the latest Counterfactual Patient Variations (CPV) benchmark can hardly distinguish the bias levels of different LLMs. To further explore these dynamics, we propose a new benchmark, FairMedQA, and benchmark 12 representative LLMs. FairMedQA contains 4,806 counterfactual question pairs constructed from 801 clinical vignettes. Our results reveal substantial accuracy disparity ranging from 3 to 19 percentage points across sensitive demographic groups. Notably, FairMedQA exposes biases that are at least 12 percentage points larger than those identified by the latest CPV benchmark, presenting superior benchmarking sensitivity. Our results underscore an urgent need for targeted debiasing techniques and more rigorous, identity-aware validation protocols before LLMs can be safely integrated into practical clinical decision-support systems.
Paper Type: Long
Research Area: Ethics, Bias, and Fairness
Research Area Keywords: model bias/fairness evaluation
Contribution Types: Model analysis & interpretability, Data resources
Languages Studied: English
Submission Number: 2428
Loading