Abstract: Medical domain automated text generation is an active area of research and development; however, evaluating the clinical quality of generated reports remains a challenge, especially in instances where domain-specific metrics are lacking, e.g. histopathology. We propose HARE (Histopathology Automated Report Evaluation)} a novel entity and relation centric framework, composed of a benchmark dataset, a NER, a RE model, and a novel metric, which prioritizes clinically relevant content by aligning critical histopathology entities and relations between reference and generated reports. To develop the HARE benchmark, we annotated 854 de-identified diagnostic histopathology reports from a hospital and 652 reports from The Cancer Genome Atlas (TCGA) with domain-specific entities and relations. We fine-tuned GatorTronS, a domain-adapted language model to develop HARE-NER and HARE-RE which achieved the highest NER F1-score (0.865) and highest RE F1-score (0.988) among the tested models. The proposed HARE metric outperformed traditional metrics including ROUGE and Meteor, as well as radiology metrics RaTEScore and RadGraph-XL, with the highest correlation and the best regression to expert evaluations (higher than the second best method, GREEN, a large language model based radiology report evaluator, by Pearson $r = 0.212$, Spearman $\rho = 0.189$, Kendall $\tau = 0.151$, $R^2 = 0.23$, $RMSE = 0.024$). We will release HARE, datasets, and the models to foster advancements in histopathology report generation, providing a robust framework for improving the quality of reports.
Paper Type: Long
Research Area: Resources and Evaluation
Research Area Keywords: Resources and Evaluation, Generation, Interpretability and Analysis of Models for NLP, Information Extraction
Contribution Types: Publicly available software and/or pre-trained models, Data resources
Languages Studied: English
Submission Number: 4159
Loading