Reassessing High-Performing LLMs on Polish Medical Exams: True Competence or Bias-Driven Performance?

Reassessing High-Performing LLMs on Polish Medical Exams: True Competence or Bias-Driven Performance?

ACL ARR 2026 January Submission2282 Authors

02 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Medical knowledge evaluation, Large Lanuage Models, low-resource dataset

Abstract: Large language models (LLMs) have shown strong performance on medical question-answering benchmarks, yet such evaluations often rely on single-choice formats that may overestimate true clinical competence. Moreover, medical LLM performance varies substantially across languages, highlighting the importance of evaluations grounded in local medical practice and linguistic context. In this work, we reassess LLM performance on Polish medical examinations by extending and refining an existing benchmark based on Polish Medical Exams. We broaden the evaluation scope by incorporating questions from additional professional medical exams, and by modifying question structures to create a more challenging and informative evaluation setting. Through these extensions, we examine how evaluation design and question formulation influence model performance across diverse medical domains in Polish. Our results provide deeper insights into the robustness and limitations of LLMs in non-English medical contexts and highlight the need for more nuanced evaluation frameworks in medical NLP.

Paper Type: Long

Research Area: Clinical and Biomedical Applications

Research Area Keywords: Generation, Question Answering, Resources and Evaluation, Multilingualism and Cross-Lingual NLP, NLP Applications, Ethics, Bias, and Fairness

Contribution Types: Model analysis & interpretability, Approaches to low-resource settings, Data resources, Data analysis

Languages Studied: Polish, English

Submission Number: 2282

Loading