A Radiology-Aware Model-Based Evaluation Metric for Report Generation

A Radiology-Aware Model-Based Evaluation Metric for Report Generation

ACL ARR 2024 June Submission361 Authors

10 Jun 2024 (modified: 02 Aug 2024)ACL ARR 2024 June SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: We propose a novel automated evaluation metric for machine-generated radiology reports using the successful COMET architecture adapted for the radiology domain. We train and publish four medically-oriented model checkpoints, including one trained on RadGraph, a radiology knowledge graph. Our results show that our metric correlates moderately to high with established metrics like BERTscore, BLEU, and CheXbert scores. In addition, we demonstrate that one of our checkpoints exhibits a high correlation with human judgment, as assessed by using the publicly available annotations of six board-certified radiologists using a set of 200 reports. We also conducted our own analysis gathering annotations with two radiologists on a collection of 100 reports. The results indicate the potential effectiveness of our method as a radiology-specific evaluation metric. Code, data, and model checkpoints to reproduce our findings will be publicly available.

Paper Type: Long

Research Area: NLP Applications

Research Area Keywords: Evaluation Metrics, Report Generation, Radiology dataset

Contribution Types: NLP engineering experiment, Approaches to low-resource settings, Data resources

Languages Studied: English

Submission Number: 361

Loading