HARE: an entity centric evaluation framework for histopathology reports

ACL ARR 2025 February Submission2438 Authors

14 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Medical domain automated text generation is an active area of research and development; however, evaluating the clinical quality of generated reports remains a challenge, especially in instances where domain-specific metrics are lacking, e.g. histopathology. We propose HARE (Histopathology Automated Report Evaluation), a novel entity-centric framework, composed of a benchmark dataset, a NER model and a novel metric, which prioritizes clinically relevant content by aligning critical histopathology entities between reference and generated reports. To develop the HARE benchmark, we curated a golden dataset of 1,196 de-identified diagnostic histopathology reports annotated with domain-specific entities and a silver dataset of 1,830 automatically annotated reports from The Cancer Genome Atlas (TCGA). We fine-tuned GatorTronS, a domain-adapted language model to develop HARE-NER which achieved the highest F1-score (0.812) among the tested NER models. The proposed HARE metric outperformed traditional metrics including ROUGE and Meteor, as well as radiology metrics RaTEScore and RadGraph-XL, with the highest correlation to expert evaluations (higher than the second best method, RadGraph-XL, by Pearson $r = 0.061$, Spearman $\rho = 0.048$, Kendall $\tau = 0.066$). We will release HARE, datasets, and the NER model to foster advancements in histopathology report generation, providing a robust framework for improving the quality of histopathology reports.
Paper Type: Long
Research Area: Generation
Research Area Keywords: Generation, Interpretability and Analysis of Models for NLP, Resources and Evaluation
Contribution Types: Model analysis & interpretability, Publicly available software and/or pre-trained models, Data resources
Languages Studied: English
Submission Number: 2438
Loading