Learning from Textual Radiology Reports: A Benchmark Dataset for Coronary CT Angiography

Sudharshan Balaji; Zhiyu Liu; Zhengyuan Jiang; Shuo Lei; Yimin Chen; Yang Xiao; Shone O. Almeida; Mathew Joseph Karivelil; Christopher Malanga; Ning Wang

Learning from Textual Radiology Reports: A Benchmark Dataset for Coronary CT Angiography

Sudharshan Balaji, Zhiyu Liu, Zhengyuan Jiang, Shuo Lei, Yimin Chen, Yang Xiao, Shone O. Almeida, Mathew Joseph Karivelil, Christopher Malanga, Ning Wang

Published: 18 Apr 2026, Last Modified: 24 Apr 2026ACL 2026 Industry Track PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Clinical NLP, Text Classification, Information Extraction, Medical Informatics, Language Models, Dataset Creation, Explainable AI, Clinical Decision Support, NLP Pipeline

TL;DR: We introduce CCTA-RADS, a new large-scale public dataset of clinical radiology reports, and propose a novel two-stage pipeline that robustly predicts disease scores from heterogeneous text, significantly outperforming direct classification approaches

Abstract: While coronary imaging is widely used for anatomical assessment, CCTA reports play a distinct last-mile role in clinical care. Rather than serving as an intermediate signal, CCTA provides an assessment of coronary disease severity (known as the CAD-RADS score) to guide patient management. However, real-world clinical text exhibits substantial heterogeneity in terminology and structure, leading to inconsistent interpretation by automated systems, even for clinically similar cases. Recent work leverages a direct application of LLMs for automated CAD-RADS scoring, but is limited by small, non-public, and homogeneous clinical data. We introduce CCTA-RADS, the largest publicly available dataset of 940 real-world CCTA reports from a major cardiovascular center, each annotated with CAD-RADS scores. Our analysis reveals that direct approaches, including state-of-the-art LLMs (GPT-4o, GPT-o3) and fine-tuned BERT models underperform on diverse real-world clinical data. To address these limitations, we propose a two-stage pipeline that decouples structuring from classification: an LLM-based parser normalizes heterogeneous reports into structured format, followed by fine-tuned BERT classification. This approach substantially improves the F1-score by 6%-13% compared with direct methods. We deploy our system as an interactive web interface that allows clinicians to upload CCTA reports for automated CAD-RADS assessment with SHAP and LIME explainability visualizations.

Submission Type: Emerging

Copyright Form: pdf

Submission Number: 115

Loading