Abstract: We propose MRScore, an innovative automatic evaluation metric specifically tailored for the generation of radiology reports. Traditional (natural language generation) NLG metrics like BLEU are inadequate for accurately assessing reports, particularly those generated by Large Language Models (LLMs). Our experimental findings give systematic evidence of these inadequacies within this paper. To overcome this challenge, we have developed a unique framework intended to guide LLMs in evaluating radiology reports, which was created in collaboration with radiologists adhering to standard human report evaluation procedures. Using this as a prompt can ensure that the LLMs’ output closely mirrors human analysis. We then used the data generated by LLMs to establish a human-labeled dataset by pairing them with accept and reject samples, subsequently training the MRScore model as the reward model with this dataset. MRScore has demonstrated a higher correlation with human judgments and superior performance in model selection when compared with traditional metrics. Our code is available on GitHub at: https://github.com/yunyiliu/MRScore.
Loading