A Radiology-Aware Model-Based Evaluation Metric for Report Generation

ACL ARR 2024 June Submission361 Authors

10 Jun 2024 (modified: 02 Aug 2024)ACL ARR 2024 June SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: We propose a novel automated evaluation metric for machine-generated radiology reports using the successful COMET architecture adapted for the radiology domain. We train and publish four medically-oriented model checkpoints, including one trained on RadGraph, a radiology knowledge graph. Our results show that our metric correlates moderately to high with established metrics like BERTscore, BLEU, and CheXbert scores. In addition, we demonstrate that one of our checkpoints exhibits a high correlation with human judgment, as assessed by using the publicly available annotations of six board-certified radiologists using a set of 200 reports. We also conducted our own analysis gathering annotations with two radiologists on a collection of 100 reports. The results indicate the potential effectiveness of our method as a radiology-specific evaluation metric. Code, data, and model checkpoints to reproduce our findings will be publicly available.
Paper Type: Long
Research Area: NLP Applications
Research Area Keywords: Evaluation Metrics, Report Generation, Radiology dataset
Contribution Types: NLP engineering experiment, Approaches to low-resource settings, Data resources
Languages Studied: English
Submission Number: 361
Loading