GREEN: Generative Radiology Report Evaluation and Error Notation

ACL ARR 2024 April Submission350 Authors

15 Apr 2024 (modified: 22 May 2024)ACL ARR 2024 April SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Evaluating radiology reports is a challenging problem as factual correctness is extremely important due to its medical nature. Existing automatic evaluation metrics either suffer from failing to consider factual correctness (e.g., BLEU and ROUGE) or are limited in their interpretability (e.g., F1CheXpert and F1RadGraph). In this paper, we introduce GREEN (Generative Radiology Report Evaluation and Error Notation), a radiology report generation metric that leverages the natural language understanding of language models to identify and explain clinically significant errors in candidate reports, both quantitatively and qualitatively. Compared to current metrics, GREEN offers: 1) a score aligned with expert preferences, 2) human interpretable explanations of clinically significant errors, enabling feedback loops with end-users, and 3) a lightweight open-source method that reaches the performance of commercial counterparts. We validate our GREEN metric by comparing it to GPT-4, as well as to error counts of 6 experts and preferences of 2 experts. Our method demonstrates not only higher correlation with expert error counts, but simultaneously higher alignment with expert preferences when compared to previous approaches.
Paper Type: Long
Research Area: Resources and Evaluation
Research Area Keywords: automatic evaluation, automatic evaluation of datasets, evaluation methodologies, evaluation, metrics
Contribution Types: Model analysis & interpretability, Publicly available software and/or pre-trained models, Data resources
Languages Studied: English
Section 2 Permission To Publish Peer Reviewers Content Agreement: Authors grant permission for ACL to publish peer reviewers' content
Submission Number: 350
Loading