ER2Score: An Explainable and Customizable Metric for Assessing Radiology Reports with LLM-based Rewards

25 Sept 2024 (modified: 15 Nov 2024)ICLR 2025 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Radiology Report Generation, Auto Evaluation Metrics, Reward Model, LLM, RLHF
Abstract: In recent years, the automated generation of radiology reports (R2Gen) has seen considerable growth, introducing new challenges in evaluation due to its complex nature. Traditional metrics often fail to provide accurate evaluations due to their reliance on rigid word-matching techniques or their exclusive focus on pathological entities, leading to inconsistencies with human assessments. To bridge this gap, we introduce ER2Score, an automatic evaluation metric designed specifically for R2Gen that harnesses the capabilities of Large Language Models (LLMs). Our metric leverages a reward model and a tailored design for training data, allowing customization of evaluation criteria based on user-defined needs. It not only scores reports according to user-specified criteria but also provides detailed sub-scores, enhancing interpretability and allowing users to adjust the criteria between clinical and linguistic aspects of reports. Leveraging GPT-4, we generate extensive evaluation data for training based on two different scoring systems, respectively, including reports of varying quality alongside corresponding scores. These GPT-generated reports are then paired as accepted and rejected samples to train an LLM towards a reward model, which assigns higher rewards to the report with high quality. Our proposed loss function enables this model to simultaneously output multiple individual rewards corresponding to the number of evaluation criteria, with their summation as our final ER2Score. Our experiments demonstrate ER2Score's heightened correlation with human judgments and superior performance in model selection compared to traditional metrics. Notably, our model's capability to provide not only a single overall score but also scores for individual evaluation items enhances the interpretability of the assessment results. We also showcase the flexible training of our model to varying evaluation systems. We will release the code on GitHub.
Primary Area: applications to computer vision, audio, language, and other modalities
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 5247
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview