Keywords: Large Language Models, ChatGPT, Radiology Reports, Impression Generation, Fine-tuning
Abstract: The integration of Large Language Models (LLMs), such as ChatGPT, in radiology could
offer insight and interpretation to the increasing number of radiological findings generated
by Artificial Intelligence (AI). However, the complexity of medical text presents many chal-
lenges for LLMs, particularly in uncommon languages such as Dutch. This study therefore
aims to evaluate ChatGPT’s ability to generate accurate ‘Impression’ sections of radiol-
ogy reports, and its effectiveness in evaluating these sections compared against human
radiologist judgments. We utilized a dataset of CT-thorax radiology reports to fine-tune
ChatGPT and then conducted a reader study with two radiologists and GPT-4 out-of-the-
box to evaluate the AI-generated ‘Impression’ sections in comparison to the originals. The
results revealed that human experts rated original impressions higher than AI-generated
ones across correctness, completeness, and conciseness, highlighting a gap in the AI’s abil-
ity to generate clinically reliable medical text. Additionally, GPT-4’s evaluations were
more favorable towards AI-generated content, indicating limitations in its out-of-the-box
use as an evaluator in specialized domains. The study emphasizes the need for cautious
integration of LLMs into medical domains and the importance of expert validation, yet
also acknowledges the inherent subjectivity in interpreting and evaluating medical reports.
Latex Code: zip
Copyright Form: pdf
Submission Number: 200
Loading