Abstract: In recent years, Large Language Models (LLMs) have gained recognition for their ability to encode clinical knowledge within their parameters. Despite their growing popularity, the existing literature lacks a comprehensive and standardized benchmark for evaluating the performance of these models in clinical knowledge applications. In response to this gap, we introduce a novel benchmark called \qalm{} designed to harmonize the evaluation of language models in the context of clinical knowledge. Our benchmark comprises 16 Multiple-Choice Question (MCQ) datasets and 6 Abstractive Question Answering (AQA) datasets, offering a diverse range of challenges to comprehensively assess model capabilities.
Our experimental results reveal intriguing insights. We find that decoder-only language models may not be the optimal choice for MCQs in clinical knowledge tasks. Additionally, our investigation demonstrates that instruction fine-tuned language models do not necessarily outperform their counterparts in these evaluations, emphasizing the importance of carefully tailored model selection.
To foster research and collaboration in this field, we make our benchmark publicly available and open-source the associated evaluation scripts. This initiative aims to facilitate further advancements in clinical knowledge representation and utilization within language models, ultimately benefiting the healthcare and natural language processing communities.
Paper Type: long
Research Area: Resources and Evaluation
Contribution Types: NLP engineering experiment, Publicly available software and/or pre-trained models, Data resources, Data analysis
Languages Studied: English
Consent To Share Submission Details: On behalf of all authors, we agree to the terms above to share our submission details.
0 Replies
Loading