Cross-Linguistic Failures and Disparities in LLM Medical Reasoning: Analyzing XMedBench and CrossMMLU Across Western and Non-Western Languages
Track: Tiny/Short Papers Track (up to 3 pages)
Keywords: Artificial Intelligence, Large Language Models, Language Disparity
TL;DR: LLMs exhibit substantial and systematic declines in clinical reasoning performance in low-resource languages compared to high-resource languages, challenging assumptions about the reliability and equity of medical AI
Abstract: As Large Language Models (LLMs) are increasingly considered for clinical decision support, their cross-lingual reliability remains uncertain. Strong Western performance is often assumed to generalize, yet has rarely been tested in non-Western languages where medical AI may alleviate ongoing healthcare inequalities. To investigate disparities, we evaluated GPT-4o, two Anthropic models (Claude 3.5 Sonnet and Claude 4.5 Sonnet), and two Gemini models (Gemini 1.5 Pro and Gemini 2.0 Flash) on two multilingual benchmarks: XMedBench, assessing clinical reasoning in six languages, and CrossMMLU, testing a variety of knowledge across seven languages. All models were run with a unified multiple-choice prompting technique and deterministic decoding to enable controlled comparisons. We find substantial cross-language variability. Several mid- and low-resource languages showed declines, especially on questions requiring spatial anatomical understanding or culturally specific terminology. English did not consistently yield the highest performance. This variance appears dataset-dependent, as observed in XMedBench results, suggesting that disparities may arise from model training biases and domain limitations. Across datasets, consistent weaknesses emerged in embryology-related questions. Overall, our findings show that multilingual clinical question answering remains uneven, emphasizing the need for linguistically inclusive evaluation and mitigation strategies.
Submission Number: 22
Loading