Keywords: Label Extraction, Radiology Reports, Large Language Models, Evaluation
TL;DR: Comparing LLMs, a standard classifier, and expert-adjudicated ground truth to assess the reliability of automated label extraction from radiology reports.
Registration Requirement: Yes
Abstract: Automated extraction of structured clinical labels from radiology reports is a key evaluation
method for vision-language models in medical imaging, yet the reliability of these labels is
rarely questioned. We analyze agreement between multiple large language models (LLMs)
and the labels from the CT-RATE dataset. Through our manual review of instances with
inter-method disagreement, we identified errors in the CT-RATE labeling and subsequently
manually corrected annotations. Furthermore, LLM-based annotators exhibit high labeling
fidelity, while the RadBERT classifier, used to create the official labels for CT-RATE,
shows higher error rates and degrades under distribution shift; specifically, the best LLM
achieves a 60% reduction of CT-RATE’s “ground truth” labeling errors. Beyond identifying
the limitations of current labeling, we provide a solution for extracting reliable reference
annotations, leading us to publish our refined CT-RATE labels at: https://zenodo.org/records/19597002.
Visa & Travel: No
Read CFP & Author Instructions: Yes
Originality Policy: Yes
Single-blind & Not Under Review Elsewhere: Yes
LLM Policy: Yes
Submission Number: 91
Loading