Evaluating ChatGPT's Performance in Generating and Assessing Dutch Radiology Report Impressions

Published: 06 Jun 2024, Last Modified: 02 Jul 2024MIDL 2024 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Large Language Models, ChatGPT, Radiology Reports, Impression Generation, Fine-tuning
Abstract: The integration of Large Language Models (LLMs), such as ChatGPT, in radiology could offer insight and interpretation to the increasing number of radiological findings generated by Artificial Intelligence (AI). However, the complexity of medical text presents many chal- lenges for LLMs, particularly in uncommon languages such as Dutch. This study therefore aims to evaluate ChatGPT’s ability to generate accurate ‘Impression’ sections of radiol- ogy reports, and its effectiveness in evaluating these sections compared against human radiologist judgments. We utilized a dataset of CT-thorax radiology reports to fine-tune ChatGPT and then conducted a reader study with two radiologists and GPT-4 out-of-the- box to evaluate the AI-generated ‘Impression’ sections in comparison to the originals. The results revealed that human experts rated original impressions higher than AI-generated ones across correctness, completeness, and conciseness, highlighting a gap in the AI’s abil- ity to generate clinically reliable medical text. Additionally, GPT-4’s evaluations were more favorable towards AI-generated content, indicating limitations in its out-of-the-box use as an evaluator in specialized domains. The study emphasizes the need for cautious integration of LLMs into medical domains and the importance of expert validation, yet also acknowledges the inherent subjectivity in interpreting and evaluating medical reports.
Latex Code: zip
Copyright Form: pdf
Submission Number: 200
Loading