Reassessing High-Performing LLMs on Polish Medical Exams: True Competence or Bias-Driven Performance?
Keywords: Medical knowledge evaluation, Large Lanuage Models, low-resource dataset
Abstract: Large language models (LLMs) have shown strong performance on medical question-answering benchmarks, yet such evaluations often rely on single-choice formats that may overestimate true clinical competence.
Moreover, medical LLM performance varies substantially across languages, highlighting the importance of evaluations grounded in local medical practice and linguistic context.
In this work, we reassess LLM performance on Polish medical examinations by extending and refining an existing benchmark based on Polish Medical Exams.
We broaden the evaluation scope by incorporating questions from additional professional medical exams, and by modifying question structures to create a more challenging and informative evaluation setting.
Through these extensions, we examine how evaluation design and question formulation influence model performance across diverse medical domains in Polish.
Our results provide deeper insights into the robustness and limitations of LLMs in non-English medical contexts and highlight the need for more nuanced evaluation frameworks in medical NLP.
Paper Type: Long
Research Area: Clinical and Biomedical Applications
Research Area Keywords: Generation, Question Answering, Resources and Evaluation, Multilingualism and Cross-Lingual NLP, NLP Applications, Ethics, Bias, and Fairness
Contribution Types: Model analysis & interpretability, Approaches to low-resource settings, Data resources, Data analysis
Languages Studied: Polish, English
Submission Number: 2282
Loading