M-QALM: A Benchmark to Assess Clinical Knowledge Recall in Language Models via Question AnsweringDownload PDF

Anonymous

16 Dec 2023ACL ARR 2023 December Blind SubmissionReaders: Everyone
TL;DR: We introduce a novel benchmark to evaluate Large Language Models (LLMs) in the medical domain.
Abstract: In recent years, Large Language Models (LLMs) have gained recognition for their ability to encode knowledge within their parameters. Despite their growing popularity, the existing literature lacks a comprehensive and standardized benchmark for evaluating the performance of these models in clinical and biomedical knowledge applications. In response to this gap, we introduce a novel benchmark called M-QALM designed to unify the evaluation of language models in such contexts. Our benchmark has 16 Multiple-Choice Question (MCQ) datasets and 6 Abstractive Question Answering (AQA) datasets, offering a diverse range of challenges to comprehensively assess model capabilities. Our experimental results reveal intriguing insights. We find that encoder-decoder and decoder-only language models have differing strengths and weaknesses across question categories in biomedical and clinical knowledge MCQA. Additionally, our investigation demonstrates that instruction fine-tuned language models perform strongly compared to their base counterparts in these evaluations, emphasizing the importance of carefully tailored model selection. To foster research and collaboration in this field, we make our benchmark publicly available and open-source the associated evaluation scripts. This initiative aims to facilitate further advancements in clinical knowledge representation and utilization within language models, ultimately benefiting the healthcare and natural language processing communities.
Paper Type: long
Research Area: Resources and Evaluation
Contribution Types: NLP engineering experiment, Publicly available software and/or pre-trained models, Data resources, Data analysis
Languages Studied: English
0 Replies

Loading