Evaluating the Evaluators: On the Reliability of Automated Label Extraction for Radiology Reports

15 Apr 2026 (modified: 16 Apr 2026)MIDL 2026 Short Papers SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Label Extraction, Radiology Reports, Large Language Models, Evaluation
TL;DR: Comparing LLMs, a standard classifier, and expert-adjudicated ground truth to assess the reliability of automated label extraction from radiology reports.
Registration Requirement: Yes
Abstract: Automated extraction of structured clinical labels from radiology reports is a key evaluation method for vision-language models in medical imaging, yet the reliability of these labels is rarely questioned. We analyze agreement between multiple large language models (LLMs) and the labels from the CT-RATE dataset. Through our manual review of instances with inter-method disagreement, we identified errors in the CT-RATE labeling and subsequently manually corrected annotations. Furthermore, LLM-based annotators exhibit high labeling fidelity, while the RadBERT classifier, used to create the official labels for CT-RATE, shows higher error rates and degrades under distribution shift; specifically, the best LLM achieves a 60% reduction of CT-RATE’s “ground truth” labeling errors. Beyond identifying the limitations of current labeling, we provide a solution for extracting reliable reference annotations, leading us to publish our refined CT-RATE labels at: https://zenodo.org/records/19597002.
Visa & Travel: No
Read CFP & Author Instructions: Yes
Originality Policy: Yes
Single-blind & Not Under Review Elsewhere: Yes
LLM Policy: Yes
Submission Number: 91
Loading