Evaluating the Evaluators: On the Reliability of Automated Label Extraction for Radiology Reports

Raphael Stock; Moritz Langenberg; David Zimmerer; Katharina Dvornikovich; Julius C. Holzschuh; Jonathan Suprijadi; Constantin Ulrich; Aditya Rastogi; Kai Schlamp; Philipp Vollmuth; Klaus Maier-Hein

Evaluating the Evaluators: On the Reliability of Automated Label Extraction for Radiology Reports

Raphael Stock, Moritz Langenberg, David Zimmerer, Katharina Dvornikovich, Julius C. Holzschuh, Jonathan Suprijadi, Constantin Ulrich, Aditya Rastogi, Kai Schlamp, Philipp Vollmuth, Klaus Maier-Hein

15 Apr 2026 (modified: 16 Apr 2026)MIDL 2026 Short Papers SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Label Extraction, Radiology Reports, Large Language Models, Evaluation

TL;DR: Comparing LLMs, a standard classifier, and expert-adjudicated ground truth to assess the reliability of automated label extraction from radiology reports.

Registration Requirement: Yes

Abstract: Automated extraction of structured clinical labels from radiology reports is a key evaluation method for vision-language models in medical imaging, yet the reliability of these labels is rarely questioned. We analyze agreement between multiple large language models (LLMs) and the labels from the CT-RATE dataset. Through our manual review of instances with inter-method disagreement, we identified errors in the CT-RATE labeling and subsequently manually corrected annotations. Furthermore, LLM-based annotators exhibit high labeling fidelity, while the RadBERT classifier, used to create the official labels for CT-RATE, shows higher error rates and degrades under distribution shift; specifically, the best LLM achieves a 60% reduction of CT-RATE’s “ground truth” labeling errors. Beyond identifying the limitations of current labeling, we provide a solution for extracting reliable reference annotations, leading us to publish our refined CT-RATE labels at: https://zenodo.org/records/19597002.

Visa & Travel: No

Read CFP & Author Instructions: Yes

Originality Policy: Yes

Single-blind & Not Under Review Elsewhere: Yes

LLM Policy: Yes

Submission Number: 91

Loading