Understanding Faithfulness and Reasoning of Large Language Models on Plain Biomedical Summaries

ACL ARR 2024 June Submission4273 Authors

16 Jun 2024 (modified: 02 Jul 2024)ACL ARR 2024 June SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Plain biomedical summaries generation with Large Language Models (LLMs) can enhance the accessibility of biomedical knowledge to the public. However, how faithful the generated summaries are remains an open yet critical question. To address this, we propose FaReBio, a benchmark dataset with expert-annotated Faithfulness and Reasoning on plain Biomedical Summaries. This dataset consists of 175 plain summaries, including 1445 sentences generated by 7 different LLMs, paired with PubMed articles. Based on our dataset, we identify the performance gap of LLMs in generating faithful plain biomedical summaries and show the impact of abstractiveness on faithfulness. We show that current faithfulness metrics do not transfer well in the biomedical domain. To better understand the faithfulness judgements, we further benchmark LLMs in retrieving supporting evidence. Going beyond the binary faithfulness labels, coupled with the annotation of supporting sentences, our dataset could further contribute to the understanding of faithfulness evaluation and reasoning.
Paper Type: Long
Research Area: Resources and Evaluation
Research Area Keywords: corpus creation; benchmarking; language resources; NLP datasets
Contribution Types: Model analysis & interpretability, Data resources, Data analysis
Languages Studied: English
Submission Number: 4273
Loading