\documentclass{midl} % Include author names

% The following packages will be automatically loaded:
% jmlr, amsmath, amssymb, natbib, graphicx, url, algorithm2e
% ifoddpage, relsize and probably more
% make sure they are installed with your latex distribution

\jmlrvolume{-- 327}
\jmlryear{2026}
\jmlrworkshop{Full Paper -- MIDL 2026}
\editors{Accepted for publication at MIDL 2026}

\usepackage{booktabs}
\usepackage{siunitx}
\usepackage[switch]{lineno}
% \usepackage{subcaption}

% for prompt template
\usepackage[framemethod=tikz]{mdframed}
\newmdenv[
  backgroundcolor=gray!10,
  linecolor=black!30,
  linewidth=0.4pt,
  innerleftmargin=6pt,innerrightmargin=6pt,
  innertopmargin=6pt,innerbottommargin=6pt,
  skipabove=\baselineskip, skipbelow=\baselineskip
]{infobox}

\usepackage{graphicx}
\usepackage{amsmath}
\usepackage{algorithm2e}
\usepackage{enumitem}
\usepackage{multirow}
\usepackage{siunitx,booktabs,array}%,tabularx}
\newcolumntype{M}{S[table-format=1.4]}  % same width for all metric columns

\usepackage{xcolor}
\usepackage[most]{tcolorbox}
\usepackage{fontawesome5} 

% === Custom Colors ===
\definecolor{cmdblue}{RGB}{0, 80, 180}   
\definecolor{errred}{RGB}{200, 40, 40}   
\definecolor{successgreen}{RGB}{40, 160, 40} 
\definecolor{loggray}{RGB}{100, 100, 100} 
\definecolor{alertorange}{RGB}{230, 120, 0}

% === Trace Box Environment ===
\newtcolorbox{tracebox}[1][]{
    colback=white,
    colframe=gray!50,
    boxrule=1pt,
    arc=4mm,
    fonttitle=\bfseries\large,
    title={#1},
    sharp corners=south,
    enhanced, 
    breakable,  
    % overlay first={...}, 
    % overlay last={...},
}

\title[RadAgents]{RadAgents: Multimodal Agentic Reasoning for Chest X-ray Interpretation with Radiologist-like Workflows}

% More complicate cases, e.g. with dual affiliations and joint authorship
\midlauthor{\Name{Kai Zhang\midljointauthortext{Internship at Oracle Health AI.}\nametag{$^{1,2}$}} \orcid{0000-0002-6322-6096} \Email{kaz321@lehigh.edu}\\
\Name{Corey D Barrett\nametag{$^{1}$}} \Email{corey.barrett@oracle.com}\\
\Name{Jangwon Kim\nametag{$^{1}$}} \Email{jangwon.kim@oracle.com}\\
\Name{Lichao Sun\nametag{$^{2}$}} \Email{lis221@lehigh.edu}\\
% \addr $^{3}$ Address 3 \AND
\Name{Tara Taghavi\nametag{$^{1}$}} \Email{tara.taghavi@oracle.com}\\
\Name{Krishnaram Kenthapadi\nametag{$^{1}$}} \Email{krishnaram.kenthapadi@oracle.com}\\
\addr $^{1}$ Oracle Health AI \\
\addr $^{2}$ Lehigh University
}

\begin{document}

\maketitle

\begin{abstract}
Agentic systems offer a potential path to solve complex clinical tasks through collaboration among specialized agents, augmented by tool use and external knowledge bases. Nevertheless, for chest X-ray (CXR) interpretation, prevailing methods remain limited: (i) reasoning is frequently neither clinically interpretable nor aligned with guidelines, reflecting mere aggregation of tool outputs; (ii) multimodal evidence is insufficiently fused, yielding text-only rationales that are not visually grounded; and (iii) systems rarely detect or resolve cross-tool inconsistencies and provide no principled verification mechanisms. 
To bridge the above gaps, we present \textbf{\textsc{RadAgents}}, a multi-agent framework that couples clinical priors with task-aware multimodal reasoning and encodes a radiologist-style workflow into a modular, auditable pipeline.
In addition, we integrate grounding and multimodal retrieval-augmentation to verify and resolve context conflicts, resulting in outputs that are more reliable, transparent, and consistent with clinical practice. 
\end{abstract}

\begin{keywords}
Multi-agent system, multimodal reasoning, chest X-ray, image interpretation.
\end{keywords}

\section{Introduction}
\label{sec:intro}

Chest X-ray (CXR) imaging is a cornerstone of pulmonary screening, diagnosis, and follow-up, accounting for the largest share of diagnostic radiology examinations performed worldwide each year~\citep{cid2024development}. Yet systematic assessment of thoracic structures remains labor-intensive, imposing a substantial time burden on radiologists~\citep{fallahpourmedrax}. The gradual introduction of AI into clinical practice shows promise for alleviating this workload~\citep{zhang2024generalist, tanno2025collaboration}. 
However, prevailing systems fall short on complex multimodal reasoning, such as integrating findings across disparate image regions, views, and time points, which is central to radiologists' practice. 
% However, prevailing systems still fall short on complex multimodal reasoning and remain misaligned with the structured, workflow-driven interpretation used by radiologists.
Most methods adhere to end-to-end designs in which the visual encoder performs a \emph{single front-end pass} and subsequent reasoning proceeds \emph{purely in text}~\citep{wang2025multimodal}. This encode-once, text-only paradigm decouples the reasoning trajectory from evolving visual evidence, leading to failures on tasks that require iterative re-inspection, precise measurements, and cross-comparisons~\citep{liu2025more} as shown in Figure \ref{fig:ctr_reasoning}.

A promising direction is to \emph{augment} large language models, including multimodal variants, with \emph{external tools}~\citep{lu2025octotools}. By delegating perceptual and classification subtasks such as organ or region segmentation and disease classification to validated modules, the language model can focus on planning and synthesis. Several agentic frameworks have explored this idea, ranging from training small models for limited tool use~\citep{li2024mmedagent, nath2025vila} to pipeline systems that invoke general-purpose models for more flexible operations~\citep{jiang2025medagentbench, schmidgall2024agentclinic}. In CXR interpretation, RadFabric~\citep{chen2025radfabric} integrates diagnostic agents with a separate reasoning agent, and MedRAX~\citep{fallahpourmedrax} expands task coverage by incorporating additional task-specific models. Despite improvements over single-model baselines, these systems often remain opaque and weakly aligned with clinical workflow: integration and reasoning steps are not explicitly traceable, visual and textual evidence are loosely coupled, and inconsistencies across tools are not systematically detected or resolved, which undermines trust and creates safety risks.

\begin{figure*}
    \centering
    \includegraphics[width=0.8\linewidth]{fig/vcot.png}
    \caption{%
    {Prior ``encode-once, text-only'' (left) and grounding/cropping-only variants (right) can fail on queries requiring iterative re-inspection and quantitative assessment (e.g., cardiothoracic ratio). RadAgents instead supports multimodal interleaved reasoning, where hypotheses trigger targeted visual operations and the final answer is grounded in explicit visual evidence.}}
    \label{fig:ctr_reasoning}
    
\end{figure*}

By contrast, radiologists reason through structured, radiology-specific workflows. For CXR, training and guidelines emphasize systematic review schemes, explicit quantitative assessments (e.g., cardiothoracic ratio, carinal angle), and comparison across time and views  \citep{hodler2019diseases}. This process is inherently \emph{interleaved}: clinicians move back and forth between image inspection, measurements, and textual synthesis, refining hypotheses as new evidence is obtained. Crucially, the reasoning is explicit and traceable, enabling peer review, error analysis, and integration with broader clinical context. Current multimodal LLM systems capture this paradigm only implicitly: reasoning is buried inside the model, tool calls are ad hoc, and there is limited support for auditing \emph{how} a conclusion was reached or \emph{which} intermediate steps failed \citep{lee2025cxreasonbench}.

To bridge this gap, we present \textbf{RadAgents}, a multi-agent framework for complex multimodal reasoning in CXR that encodes radiologist-like workflows into a modular, auditable pipeline. RadAgents decomposes interpretation into seven specialized agents: five subagents that implement core radiologic review modes, an \emph{Orchestrator} that analyzes queries and dispatches tasks, and a \emph{Synthesizer} that aggregates outputs, performs contextual verification, and resolves cross-tool conflicts. Each subagent follows predefined, radiologist-style workflows when applicable (e.g., for cardiomegaly or pleural effusion), while out-of-template queries fall back to workflow-free ReAct-style reasoning. Throughout, RadAgents maintains explicit logs of intermediate artifacts (segmentations, measurements, retrieved exemplars, and rationales), yielding step-level traceability instead of a single opaque explanation.

From a deployment standpoint, we instantiate RadAgents with open-source, lightweight vision--language models (Qwen3-VL-Instruct 4B/8B/30B \citep{Qwen3-VL}) as the core engines of each agent. We show that, when paired with structured workflows, tool integration, and conflict resolution, an 8B open model can match or surpass GPT-4o and specialist CXR models such as CheXagent \citep{chen2024chexagent} on diverse benchmarks. This suggests that trustworthy, workflow-aligned CXR reasoning does not strictly require very large proprietary models and can be realized with more accessible, on-premise-friendly architectures. Our contributions are threefold:
\begin{itemize}
    \item We formalize \emph{radiologist-like multimodal workflows} for CXR interpretation and encode them in a multi-agent system that interleaves visual evidence, measurements, and textual reasoning. RadAgents supports both workflow-guided and workflow-free modes, enabling coverage of common guideline-style tasks as well as open-ended queries.
    \item We design a \emph{traceable, tool-augmented agentic architecture} that combines an Orchestrator, subagents, and a Synthesizer with retrieval-augmented conflict resolution and short-term memory. This yields explicit step-by-step trajectories and principled handling of cross-tool inconsistencies, improving alignment with clinical practice.
    \item We conduct an extensive study on across three challenging multimodal medical-reasoning datasets. RadAgents consistently outperforms competitive baselines by a substantial margin, and ablations show the importance of radiologist-like workflows, visual retrieval, and targeted scaling of the Synthesizer.
\end{itemize}
\section{Methodology}
\label{sec:method}

\begin{figure*}[ht]
    \centering
    \includegraphics[width=\linewidth]{fig/radagents_framework.pdf}
    \caption{RadAgents framework. Each ABCDE subagent executes in parallel guided by clinical workflows, lowering latency, preserving isolation to avoid long-context drift, and improving trustworthiness.}
    % {+ one active ReAct agent, and one idle agent; Active sub-agent -- + with predefined workflow}}
    \label{fig:framework}
\end{figure*}

RadAgents is a multi agent system with seven specialized agents (Figure~\ref{fig:framework}). Five implement the clinical \textbf{ABCDE} review scheme \citep{hodler2019diseases}: \textit{\textbf{A}irway, \textbf{B}reathing, \textbf{C}irculation, \textbf{D}iaphragm}, and \textit{\textbf{E}verything else}. In addition, an \textit{Orchestrator} agent analyzes each query and routes tasks to the appropriate specialists with the required patient context (for example, imaging view and prior studies), and a \textit{Synthesizer} agent integrates their outputs, resolves conflicts, and produces the final output.
% an agent could be activated multiple times but with different context window, which is useful for Comparison.
This design confines context to task specific compartments, reducing the information each agent must process and simplifying context compression by having each sub-agent produce an initial summary for downstream synthesis. It also allows parallel execution, lowering latency for long reasoning. 

% For clinically significant CXR findings such as cardiomegaly and pleural effusion, we curate radiologist-like workflows, the predefined templates within RadAgents, to guide tool selection and clinically grounded reasoning (see the demonstration example in Appendix~\ref{apd:demonstration}). For out-of-template queries, the system invokes workflow-free reasoning, preserving flexibility. The design is extensible: new templates (e.g., reasoning or tool-chains) can be added, and some can generalize to tasks of similar scope or category.



\subsection{Tool Set}

To support radiologist-like reasoning, RadAgents integrates a comprehensive tool set that includes state-of-the-art machine learning models for specific tasks, general-purpose data processing utilities, and Python modules for measurement and calculation based on intermediate results, and for data utility. 
% As the capabilities of medical foundation models continue to expand~\citep{}, a single model can increasingly be reused to serve diverse tasks within this ecosystem.

\begin{itemize}[leftmargin=*, nosep]
    \item \textbf{ROI Segmentation.} Region-of-interest (ROI; e.g., anatomical structures and lesions) segmentation plays a central role in medical reasoning, as it provides interpretable visual evidence and often constitutes the first step in diagnostic workflows. For example, cardiothoracic ratio (CTR) calculation requires measuring both thoracic width and cardiac width from segmentation masks.\\
    \textbf{Tool list:} (a) CXAS, an anatomy segmentation model~\citep{seibold2023accurate}, which can segment up to 157 anatomical structures relevant to chest radiography; (b) BiomedParser~\citep{zhao2024biomedparse}, a text-driven medical image parsing model that covers 82 major biomedical object ontologies, such as viral pneumonia.

    \item  \textbf{Phrase Grounding.} Unlike object-level segmentation, phrase grounding aims to localize a finding described by free-text (e.g., ``right lower lobe opacity'') via a bounding box, providing finding-level evidence for generated outputs.\\
    \textbf{Tool list:} MAIRA-2~\citep{bannur2024maira}, selected because it is trained on diverse (public and private) grounded datasets.

    \item \textbf{Measurement and Calculation.} Radiologists routinely assess the size, shape, and geometric relationships of ROIs to refine diagnoses, such as measuring the carinal angle or estimating pleural effusion volume for severity assessment.\\
    \textbf{Tool list:} We implement a suite of reusable Python scripts that perform geometric measurements and numeric calculations given either image inputs (e.g., segmentation masks, keypoints) or structured text. 
    Further details are provided in Appendix~\ref{appx:skillset-anatomy}.

    \item \textbf{Visual Question Answering (VQA).} Medical VQA models enable the agentic system to handle flexible free-form queries, especially when it is unnecessary or overly costly to execute full visual reasoning pipelines.\\
    \textbf{Tool list:} We adopt MedGemma~\citep{sellergren2025medgemma}, which combines strong instruction-following ability with medical knowledge. As a specialist CXR model, CheXagent is added as a complementary tool.

    \item \textbf{Report Generation.} Report generation models serve as references to initialize or supplement the final radiology report, particularly for routine findings.\\
    \textbf{Tool list:} We use the CheXpert Plus report generator~\citep{chambon2024chexpert}. MAIRA-2 is reused here to provide grounded visual evidence that can be integrated into the generated report.

    \item \textbf{Pathology Classification.} For certain pathology-specific pixel patterns, it is difficult to explicitly quantify the features needed for reasoning, and classification models become particularly valuable.\\
    \textbf{Tool list:} (a) A DenseNet-121 model from TorchXRayVision~\citep{cohen2022torchxrayvision}, trained on four large-scale CXR datasets; (b) the VQA models, which can also be used in a classification mode by constraining their outputs.

    \item \textbf{Data Processing.} General data-processing utilities include a DICOM loader (with metadata parsing for fine-grained measurement), visualization tools, and basic preprocessing operations such as contrast adjustment and resizing, which standardize inputs for downstream tools.
\end{itemize}


    % + Baichuan M2?
    % + image quality control, and image augmentation/enhancement module
    % + data analysis self-programming ability (code generation), radiomics



% \noindent \textbf{Tools.} We employ a suite of models as tools for distinct CXR tasks: CheXagent  for VQA, MAIRA-2 \citep{bannur2024maira} for grounding, the CheXpert Plus report generator \citep{chambon2024chexpert}, and . In addition, we include unique programming tools that return zoomed-in quarter patches or serve for measurement and calculation purposes.


\subsection{Task-aware Subagents}

Each subagent, also called the ABCDE agent, has a defined purpose and domain of expertise. Each is governed by a custom system prompt and maintains its own context window. The main scope and objectives of them are:

\noindent \textbf{{A}irway agent:} Systematically assess the central thorax for airway patency, alignment, and paratracheal lesions; for example, determine tracheal position (midline versus deviation).

\noindent \textbf{{B}reathing agent:} Survey the lungs and pleura for parenchymal and pleural pathology; for example, detect opacities (atelectasis, inltrate) and nodule.

\noindent \textbf{{C}irculation agent:} Evaluate the cardiac silhouette, mediastinum, and vessels; for example, compute the cardiothoracic ratio.

\noindent \textbf{{D}iaphragm agent:} Assess diaphragmatic integrity and look for subdiaphragmatic air; for example, compare right and left diaphragm height.

\noindent \textbf{{E}verything-else agent:} Identify chest wall (ribs and fractures), soft tissue, and foreign materials like medical devices.

Each subagent is equipped with an LLM and an individual \textbf{skill set} encoded in its system prompt. The skill set is a collection of reusable skill units, each defined as a reference tool chain together with decision thresholds for a specific clinical purpose. For example, computing the carinal angle requires first obtaining a segmentation mask of the tracheal bifurcation, then applying a geometric algorithm to identify the carina and main bronchi, and finally comparing the resulting angle against the normal range of 40--80 degrees. 

The operational plan, i.e., which skills to invoke and in what order, is determined by the high-level intent distributed by the Orchestrator. Conditioned on this intent, each subagent performs step-by-step reasoning and tool use following the ReAct (``Reason + Act'') paradigm~\citep{yao2023react}; if a step fails or produces inconsistent evidence, the subagent triggers local re-planning to repair the failure. This behavior is agentic rather than a fixed automation script, making the system more robust to uncertainty while still leveraging established clinical practice patterns. Details of 
%the prompt design and 
the skill sets for each subagent are provided in Appendix~\ref{appx:skillset-anatomy} and ~\ref{appx:skillset-pathology}.


% if do not meet the skill set, react, and if the task is sucessful, save the new skill unit.
% (only when the skillset is large, e.g., > 30 items) Skillset could be taken as the external database instead of directly injected into the system prompt to maintain extensibility and reduce token cost.

\subsection{Global Controller Module}
The global controller comprises the \textit{Orchestrator} and the \textit{Synthesizer}. The Orchestrator selects subagents and allocates tasks with appropriate patient context, and the Synthesizer integrates their outputs, verifies consistency, and resolves errors and conflicts. The major components are detailed below.

\noindent \textbf{Query analysis.} Given a query, the \textit{Orchestrator} first analyzes the intent, extracts key clinical entities and objectives, and then drafts a high-level plan. It activates only the associated subagents (all other agents remain idle, incurring no additional computation cost) and sends them clear, goal-oriented instructions that specify the required clinical indicators without prescribing which tools to use. For example, for the query \textit{``Is there lung opacity?"}, the Orchestrator generates a plan derived from our predefined \textbf{Workflow}, such as: \textit{``Goal: (1) determine the existence of lung opacity; (2) if present, determine the type; (3) determine the location; and (4) verify the answer."}

This decoupling between the Orchestrator and the tool set greatly improves maintainability: otherwise, adding a new tool or updating an existing one would require rewriting the workflow. Instead, we introduce a skill layer as an intermediate abstraction and let each subagent, guided by language understanding, dynamically compose tools to accomplish a given skill. In this way, workflows, skills, and tools can be maintained and evolved independently. If a task is dispatched incorrectly, the receiving subagent raises a \texttt{SkillMismatchError} to request re-dispatch. 
% Details of the workflow design are provided in Appendix~\ref{appx:workflow_design}.


% \underline{ReAct} when no workflow is specified \citep{yao2023react}, or \underline{Plan-and-Execute (P\&E)} \citep{wang2023lim} when a workflow template is available. This keeps the system language driven and adaptable across queries.

% \noindent \textbf{Context verifier.} (does not work for current small mllm) No tool is perfect, as their capabilities are constrained by model size and training data. 
% When uncertainty arises (for medical LLM, self-consistency: output several times and majority vote may be most stable but it is costly), we trigger a verification step in which an advanced multimodal LLM serves as a judge \citep{chen2024mllm}, filtering out incorrect outputs such as erroneous masks.

% confidence score -- joint log-likelihood probability over tokens for MLLM.

\begin{figure}[ht]
    \centering
    \includegraphics[width=0.7\linewidth]{fig/vrag.png}
    \caption{%
    {V-RAG Mechanism. When tool outputs conflict, the system retrieves top-$k$ clinically similar CXR studies as reference standards to verify the findings and resolve the disagreement (e.g., Edema and Cardiomegaly).}}
    \label{fig:vrag}
\end{figure}

\noindent \textbf{Retrieval-augmented conflict resolution.}
No tool is perfect, as its capabilities are bounded by model size and training data, and different tools may produce conflicting outputs. On the \textit{Synthesizer} side, we therefore apply Visual Retrieval-Augmented Generation (V-RAG)~\citep{chu2025reducing}: the Synthesizer retrieves clinically similar chest radiographs (using image embeddings from Rad-DINO~\citep{perez2025exploring}) together with associated context such as patient notes, and leverages these exemplars to adjudicate discrepancies among tools (Figure~\ref{fig:vrag} and Appendix~\ref{apd:vrag}). This design mirrors routine radiologic practice, where clinicians consult prior cases and reference material to calibrate their interpretations.

\noindent \textbf{Short-term memory.}
RadAgents maintains a shared short-term memory that caches patient-specific context, including demographics, clinical indications, acquisition information (e.g., AP vs.\ PA views), and metadata when DICOM images are provided. In addition, the short-term memory stores tool outputs, which are accessible to all agents. If an agent needs to call a tool whose result is already cached, it directly reads the cached output instead of re-invoking the tool. This mechanism prevents redundant computation, reduces latency, and improves consistency in multi-step analyses that repeatedly reference the same intermediate results.

% We employ approximate kNN with the Hierarchical Navigable Small World (HNSW) algorithm \citep{malkov2018efficient}, enabling retrieval of the top-$k$ most similar images in $\mathcal{M}$.

\section{Experiments}
\label{sec:exp}

\begin{figure*}
    \centering
    \includegraphics[width=1.0\linewidth]{fig/chestagentbench_barplot.pdf}
    \caption{Performance on ChestAgentBench across different categories of questions.} 
    \label{fig:chestagentbench_results}
\end{figure*}

\subsection{Experimental Setup}
\noindent \textbf{Datasets.}
To demonstrate the generality of \textbf{RadAgents}, we evaluate it on three benchmark datasets that closely mirror complex clinical workflows, offer sufficient task diversity, and are relatively easy to evaluate for correctness:
(1) ChestAgentBench \citep{fallahpourmedrax} includes 2{,}500 questions derived from expert-validated clinical cases, covering seven core competencies and associated reasoning skills essential for CXR interpretation.
(2) For the CheXbench \citep{chen2024chexagent} subset, following MedRAX, we focus on visual question answering (115 cases from Rad-Restruct \citep{pellegrini2023rad} and 123 cases from SLAKE \citep{liu2021slake}) and 380 fine-grained multimodal reasoning questions from OpenI \footnote{https://openi.nlm.nih.gov/}.
(3) We further evaluate on the preprocessed multi-view and longitudinal MIMIC-CXR test set (2{,}231 cases) used in EditGRPO \citep{zhang2025editgrpo}.

\noindent \textbf{Baselines.}
Unless otherwise specified, we instantiate all agents with Qwen3-VL-Instruct-8B.
Additional results using GPT-4o as the agent core, with the same tool set as MedRAX, are provided in Appendix~\ref{appx:radagents_gpt-4o}.
For comparison, we include:
(1) specialist models CheXagent and GPT-4o;
(2) Qwen3-VL in a single-agent setting using ReAct (following MedRAX) without workflow steering;
(3) Qwen3-VL (single agent) using ReAct guided by our workflow templates (including the full skill set).
Unless otherwise noted, the number of retrieved exemplars for V-RAG is set to $k=3$ (see Figure~\ref{fig:rag_ablation} and Appendix~\ref{apd:sens_k_vrag} for ablation study).
We report results for two variants of RadAgents, with and without V-RAG.

\noindent \textbf{Metrics.}
For ChestAgentBench and CheXbench, which consist of closed-ended questions, we report accuracy.
For report generation, we use GREEN score \citep{ostmeier2024green}, which follows an LLM-as-a-judge paradigm and has been shown to align well with human judgments.
Because the outputs are free-text sentences, this metric captures both clinical correctness and consistency. 

% we use standard CXR text metrics (explanation in Appendix~\ref{apd:metrics}): RadGraph F1 \citep{jain1radgraph}, CheXbert macro F1 across 14 labels \citep{smit2020combining}, RaTE \citep{zhao2024ratescore}, and GREEN \citep{ostmeier2024green}. Because the outputs are sentences, these metrics capture clinical correctness and consistency.

\subsection{Main Results}

\begin{table}
    \centering
    % \small
    \caption{Accuracy (\%) comparison on CheXbench.}
    \label{tab:radagents-chexbench}
    % \resizebox{0.85\linewidth}{!}{ 
    \begin{tabular}{lccc|c}
        \toprule
        \multirow{2}{*}{\textbf{Model}} & \multicolumn{2}{c}{\textbf{VQA}} & \multirow{2}{*}{\textbf{OpenI Reasoning}} & \multirow{2}{*}{\textbf{Overall}} \\
        \cmidrule(lr){2-3}
        & \textbf{Rad-Restruct} & \textbf{SLAKE} & & \\
        \midrule
        CheXagent & 57.1 & 78.1 & 59.0 & 64.7 \\
        GPT-4o & 53.9 & 85.4 & 51.1 & 63.5 \\
        Qwen3-VL w/ ReAct & 70.4 & 86.2 & 61.3 & 68.0 \\
        Qwen3-VL w/ Workflow & 72.2 & 87.8 & 65.3 & 71.0 \\
        RadAgents wo/ V-RAG & 71.3 & 87.8 & 66.3 & 71.5 \\
        RadAgents & 76.5 & 89.4 & 69.2 & 74.6 \\
        \bottomrule
    \end{tabular}
    % }
\end{table}

\noindent \textbf{ChestAgentBench.}
Figure~\ref{fig:chestagentbench_results} compares the performance of different systems on the seven categories in ChestAgentBench. RadAgents achieves the best accuracy in every category, yielding an overall score of 73.6\%, substantially higher than CheXagent (39.5\%), GPT-4o (56.4\%), Qwen3-VL w/ ReAct (61.3\%), Qwen3-VL w/ Workflow (63.5\%), and RadAgents without V-RAG (66.9\%). The gains are consistent across tasks, with RadAgents outperforming the strongest non-agent baseline (Qwen3-VL w/ Workflow) by $7$--$10$ points on most categories. The largest margins are observed on diagnosis (73.9\% vs.\ 64.3\%) and characterization (71.0\% vs.\ 61.5\%), which require synthesizing subtle imaging findings and clinical priors. Comparing RadAgents with and without V-RAG also reveals a clear benefit from visual retrieval: V-RAG contributes around 6--7 absolute points overall, suggesting that access to external image evidence is particularly helpful for fine-grained and high-level reasoning questions.

\noindent \textbf{CheXbench.}
Table~\ref{tab:radagents-chexbench} reports results on CheXbench, which includes two VQA benchmarks (Rad-Restruct and SLAKE) and the OpenI image-text reasoning task. RadAgents again attains the highest overall accuracy (74.6\%), improving upon both domain-specific CheXagent (64.7\%) and the general-purpose GPT-4o (63.5\%). On visual QA, RadAgents reaches 76.5\% on Rad-Restruct and 89.4\% on SLAKE, indicating strong capability in localized and fine-grained visual understanding. Qwen3-VL w/ Workflow and RadAgents w/o V-RAG are competitive, but RadAgents still provides a consistent 2--4 point advantage across all three sub-tasks. The OpenI reasoning task is more challenging for all models, yet RadAgents achieves 69.2\% accuracy, outperforming RadAgents w/o V-RAG (66.3\%) and other baselines. These results highlight that the proposed agentic workflow, together with visual retrieval, not only enhances structured VQA but also benefits more global image--text reasoning.

\noindent \textbf{MIMIC-CXR Report Generation.}
Figure~\ref{fig:radagents-mimic-cxr} reports GREEN scores on the MIMIC-CXR test set under the multi-view and longitudinal setting, which better reflects real clinical workflows. RadAgents attains the highest GREEN score of 51.4, outperforming RadAgents w/o V-RAG (46.1), GPT-4o w/ ReAct (42.3), GPT-4o w/ Workflow (41.7), plain GPT-4o (34.2), and CheXagent (23.6). The relatively marginal improvement of the workflow-based variants over plain GPT-4o suggests that simply feeding long multi-study contexts to a single agent is insufficient, as the model can become ``lost in the middle" \citep{liu2024lost} and under-utilize information dispersed across the sequence. The sizeable gap between RadAgents and its ablated variant highlights that visual retrieval and our specialized multi-agent design are particularly beneficial for generating temporally consistent reports conditioned on multiple studies.

\begin{figure}[ht]
    \centering
    \includegraphics[width=0.65\linewidth]{fig/radagents_rrg_barplot.pdf}
    \caption{{Multi-view and longitudinal performance on MIMIC-CXR test set.}}
    \label{fig:radagents-mimic-cxr}
\end{figure}
\subsection{Ablation Study}
We investigate how the capability of the core LLM (i.e., model scale/parameters) influences RadAgents by varying the backbone Qwen3-VL-Instruct model from 4B to 8B and 30B. Since RadAgents contains multiple roles, in each ablation we upgrade only a single component, either the Orchestrator or the Synthesizer, while keeping all other agents at the default 8B model\footnote{Running all five agents with 30B models in parallel would be prohibitively expensive.}. For example, when evaluating the Synthesizer, the Orchestrator and all subagents use the 8B model, whereas the Synthesizer uses the 30B model. This setup allows us to separately assess (a) whether a stronger Orchestrator can better orchestrate subagents (i.e., activate the correct specialists and deliver precise workflows), and (b) whether a stronger Synthesizer can more effectively resolve conflicts between tools and subagents.

The evaluation set comprises 100 representative cases: (a) 50 VQA instances randomly sampled from the MS-CXR test set, querying the existence and attributes of abnormalities (e.g., size and severity), and (b) 50 report-generation cases from the MIMIC-CXR test set, covering medium to high complexity. Details of the underlying datasets are provided in Appendix~\ref{appx:radagents_gpt-4o}.

Figure~\ref{fig:ablation} shows that the dispatch success rate, defined as the proportion of cases where the correct subagents are activated and no ``request re-dispatch'' error is raised, increases from 87.0\% to 93.0\% as the Orchestrator model size grows from 4B to 30B. This indicates that stronger language understanding helps distribute tasks more reliably, though the gain is relatively modest because our hybrid search design already resolves most dispatch ambiguities. In contrast, the conflict-resolution rate of the Synthesizer (measured over the 38 cases with inter-tool disagreements, 9 from VQA and the remainder from report generation) improves substantially, rising from 26.3\% (4B) to 44.7\% (8B) and 60.5\% (30B). These results suggest that conflict resolution is considerably more sensitive to model capacity than task dispatch, and that investing capacity in the Synthesizer is crucial for reliably reconciling heterogeneous tool and agent outputs.

\begin{figure}[t]
    \centering
    \begin{minipage}[t]{0.48\linewidth}
        \centering
        \includegraphics[width=\linewidth]{fig/dispatch_conflict_vs_scale.pdf}
        \caption{{Effect of model scale on Orchestrator dispatch success and Synthesizer conflict-resolution rates.}}
        \label{fig:ablation}
    \end{minipage}
    \hfill
    \begin{minipage}[t]{0.48\linewidth}
        \centering
        \includegraphics[width=\linewidth]{fig/retrieval_k_rates.pdf}
        \caption{{Pilot study for $k$ selection in V-RAG, showing that larger $k$ increases the helpful retrieval rate but also raises the harmful rate.}}
        \label{fig:rag_ablation}
    \end{minipage}
    
\end{figure}


\section{{Discussion}}
RadAgents leverages radiologist-inspired workflows, tool-augmented reasoning, and visual retrieval to achieve robust CXR interpretation across three complex benchmarks. Despite these advancements, the current framework has limitations. First, performance is intrinsically upper-bounded by the capabilities of the underlying tools; specifically, the current reliance on tools optimized for frontal-view X-rays limits robustness on lateral views. Second, the interaction between the orchestrator and tools is unidirectional: while the synthesizer can resolve conflicts via evidence weighting, it cannot iteratively guide or correct upstream tool outputs (e.g., refining an imperfect segmentation mask). Additionally, the multi-agent architecture incurs higher computational costs compared to end-to-end baselines. Future work will address these challenges by optimizing efficiency through dynamic agent selection and extending the framework to 3D modalities (e.g., CT and MRI) via modality-specific sub-agents, alongside prospective studies to validate clinical impact.







\clearpage  % Acknowledgements, references, and appendix do not count toward the page limit (if any)
% Acknowledgments---Will not appear in anonymized version
\midlacknowledgments{We would like to thank other members of Oracle Health AI for their support while developing our system, and Raefer Gabriel, Sri Gadde, Mark Johnson, Devashish Khatwani, Yuan-Fang Li, Anit Sahu, Praphul Singh, and Vishal Vishnoi for insightful feedback and discussions.}

\bibliography{midl26_327}

\input{appendix}

\end{document}
