Abstract: Large Language Models (LLMs) are increasingly deployed in medical applications. However, these systems can exhibit biases related to gender, race and professional role, raising significant concerns about their impact on healthcare equity. We propose a systematic framework for evaluating social biases in medical question answering using LLMs. By varying only the role information in prompts across three medical focused subsets of the MMLU benchmark, including College Medicine, Medical Genetics and Professional Medicine, we evaluate multiple LLMs’ performance and quantify bias gaps. Our results highlight the necessity of rigorous bias assessment in medical AI and provide a practical framework for measuring disparities across diverse role dimensions prior to clinical deployment.
External IDs:dblp:conf/aiih/XiaoZPF25
Loading