How Well Can Modern LLMs Act as Agent Cores in Radiology Environments?

How Well Can Modern LLMs Act as Agent Cores in Radiology Environments?

ACL ARR 2026 January Submission9888 Authors

06 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: radiology, agentic system, large language models

Abstract: Large language models (LLMs) hold promise for building accurate and interpretable agentic systems in complex domains like radiology. To evaluate whether modern LLMs can serve as agent cores in radiology settings, we introduce \textbf{RadA-BenchPlat}, a comprehensive platform built on 2,200 patient records spanning 6 anatomical regions, 5 imaging modalities, and 2,200 diseases. The dataset includes 24,200 QA pairs and 10 tool categories for radiology task-solving. Our benchmarking of 7 leading LLMs reveals significant gaps: while models such as Claude-3.7-Sonnet achieve 67.1\% task completion in routine scenarios, they struggle with complex reasoning and tool coordination. We then apply prompt engineering strategies, yielding an overall 48.2\% performance gain (\(p < 0.001\)) on complex tasks-with \textbf{prompt backpropagation} and \textbf{multi-agent collaboration} contributing 16.8\% (\(p < 0.01\)) and 30.7\% (\(p < 0.001\)) improvements, respectively. We further enhance robustness via automated tool building, reaching 65.4\% success. Our work provides critical benchmarks and actionable strategies for developing reliable radiology AI agents, moving closer to fully automated clinical applications. Code and data are prepared and will be available upon publication.

Paper Type: Long

Research Area: Clinical and Biomedical Applications

Research Area Keywords: clinical decision support, Clinical and biomedical language models

Contribution Types: NLP engineering experiment, Data resources

Languages Studied: English

Submission Number: 9888

Loading