\section{Experiments}
\label{sec:exps}

\subsection{Multi-Organ Benchmark from Clinical Abdominal CT} 
\label{sec:Multi-Organ Benchmark}
To simulate the real-world management of incidental findings, where multiple organs need to be checked according to various guidelines (in \texttt{PDF}s) and adhering to hospital protocols, we conducted a benchmark using American College of Radiology (ACR) guidelines for three different organs: Liver, Pancreas, and Kidney. The experiments are based on our internal dataset, which consists of a large set of abdominal CT scans paired with radiology reports. This dataset was used to develop a procedure for collecting test data for our method. The data includes thousands of abdominal CT scans from 6,366 unique patients. The scans were made in various phases, including venous and arterial phases.

For liver scans, we restricted our investigation to venous phase CT scans, 
%as this phase provides optimal IV enhancement of the liver, 
allowing existing segmentation models to effectively detect lesions. To ensure a balanced set of recommendations, we included scans both with and without liver lesion detections. This approach allowed us to cover the different paths in the decision tree comprehensively. Consequently, we included scans that were conducted for liver inspection, where liver lesions are more likely to be found. Ultimately, we gathered 168 scans, providing a good balance of the possible decision tree paths.

We applied the same type of filtering for the pancreas and kidney. For the pancreas, we selected venous phase CT scans, while for the kidney, we chose arterial phase CT scans. This approach ensured that lesions in these organs were detectable by existing segmentation models, allowing us to create a balanced and comprehensive dataset for each organ. Specifically, we gathered 168 scans for the liver, 188 scans for the pancreas, and 98 scans for the kidney.

To ensure our method adhered to established clinical guidelines, we selected guidelines from the ACR website and collected PDFs for each organ. We then used the parsed tree procedure described in Section \ref{sec:parsing_guidelines} to parse these guidelines. This process involved converting the PDF text and figures into JSON formats, which included structured information such as checks, detections, measurements, and recommendations. 

\subsubsection{Extracting "correct" recommendations from reports.}
\label{sec:synthetic_correct_recommendations}
For each scan, our method aims to generate recommendations based on guidelines for the management of incidental findings in a specific organ. To evaluate the predicted recommendations, we built a procedure to also obtain "correct" recommendations extracted from reports.
%
First, we note that the radiologist's report for each scan includes detailed observations and patient background information, which can be used to infer the recommendation and its explanation as reflected by a trajectory in the parsed tree. Next, we generated a list of all possible paths in the decision tree by traversing it, with each path representing a sequence of checks and decisions leading to a specific recommendation. We then used an LLM (GPT-4o, ~\cite{openai2023gpt4o}) to review the radiology report and select the best tree path that matches the report. 
%The LLM analyzes the report's content and compares it with the possible paths in the decision tree to determine the most accurate recommendation.
This selected path is considered the "correct" recommendation for the scan (the leaf includes the recommendation, while the rest of the path can be considered its "explanation"), which we use to test both baseline and our model. 
% An example of the selection and explanation provided by the LLM for a sample report is shown in Fig.~\ref{fig:synthetic_gt}.
% %
% \begin{figure}[h]
%     \centering
%     \includegraphics[width=0.8\textwidth]{figures/gen_gt.png}
%     \caption{The LLM reviews the radiology report and selects the best matching path in the parsed decision tree (all paths are marked as `Labels`), which is then used as the "correct" label for evaluating the models.}
%     \label{fig:synthetic_gt}
% \end{figure}
% %


\subsection{Evaluation of INFORM-CT}
\label{sec:eval_informct}

\subsubsection{Implementation details.}
 We utilized the CLAUDE 3.5 LLM~\cite{anthropic2024claude} to intelligently translate the logic described in the guidelines into code. For example, the guidelines might specify that a mass is "Homogeneous (thin or imperceptible wall, no mural nodule, septa, or calcification)." This description needs to be converted into code that operates the base functions. The logic of "or" and "not", as well as the computation of attributes such as "thin", "thick", and "calcification" are all handled by the base functions and orchestrated by the program.
%
We implemented most of the image processing functions using standard Python libraries. For running MERLIN as a labeler to obtain higher-level attributes, we computed the cosine similarity of the entire 3D scan matched to a set of labels representing all potential attribute values.
%
We were limited in this implementation by the variety of available strong segmentation models for abdominal organ lesions. All segmentation models were implemented using nnUNET and taken from its GitHub repository~\cite{isensee2021nnu,wasserthal2023totalsegmentator}.

All programs were automatically generated from a PDF containing guidelines. Our experiments included guidelines from the ACR, but any adjustment of these guidelines according to specific hospital protocols, as well as guidelines from different radiology organizations (e.g., the ESR - European Society of Radiology), can be accommodated.

\subsubsection{Baseline Evaluation Using MERLIN Model.} 
Our model was evaluated against the MERLIN baseline~\cite{blankemeier2024merlin}, a Vision Language Model similar to CLIP. To evaluate MERLIN, we listed all the possible paths in the decision tree, concatenating the text in the nodes. We also included the patient background information (such as age and risk factor), as it was provided to INFORM-CT, to ensure a fair comparison. For each decision path combined with the background information, we computed the cosine similarity and selected the path with the highest score as the prediction of the MERLIN model.

\input{tables/multiorgan_findings}
\subsubsection{Analysis.} The results of matching the recommendation predictions of our models, as well as the MERLIN model (as mentioned above), are shown in \tableref{tab:multiorgan_classification}. 
~\figureref{fig:combined_results} illustrates the process, showing link to the guideline PDF, pieces from the generated code, the execution of the code via base functions, and the execution output for liver and kidney (renal) incidental findings management over two sample scans.
%
The results indicate that our method can effectively handle the automatic management of incidental findings for different abdominal organs. The accuracy of the final recommendation predictions is relatively high, typically much higher than applying a pure VLM approach ("Pure Merlin") on a real-world clinical benchmark.
%
We also evaluated the correctness of decisions made along the way, namely the path in the decision tree that yielded the recommendation. This is the explainable part of our model, and this evaluation sheds light on how explainable the model is and how well it matches the correct explanation computed from the report (as explained in Section \ref{sec:Multi-Organ Benchmark}). The results, as shown in ~\tableref{tab:ablation}, indicate that the model explanation matches those provided in the report for the majority of cases, while MERLIN provides limited explanatory capability.

Finally, we turn to assess the contribution of the internal components of the model on performance. Specifically, we evaluate the contribution of the segmentation base functions through an ablation study. In Table~\ref{tab:ablation}, we present the recommendation accuracy (in percentage) of an ablated INFORM-CT model in which segmentation tasks are converted to text and are also performed by MERLIN. 
Comparing this ablated model to the full INFORM-CT reveals that the segmentation component is critical to the success of the model and cannot be replaced by the VLM. However, the VLM and image processing routines are also crucial components of INFORM-CT, leading to the conclusion that the whole is greater than the sum of its parts.



\begin{figure}[t]
    \centering
    % \floatcounts
    \begin{tabular}{cc}
        \includegraphics[width=0.54\textwidth]{figures/result_liver_figure.png} & 
        \includegraphics[width=0.54\textwidth]{figures/result_renal_figure.png} \\
        \textbf{(A)} & \textbf{(B)} \\
    \end{tabular}
    
    \caption{
    Predictions of the INFORM-CT model for scans, adhering to ACR guidelines for the management of incidental findings. Process and results are shown for the liver (A), renal (B). Selected tree trajectory is shown as output, and final recommendation is marked magenta color. %\TODO{change "decision" to "recommendation"}
    }
    \label{fig:combined_results}
\end{figure}


% \begin{figure}[htbp]
%  % Caption and label go in the first argument and the figure contents
%  % go in the second argument
% \floatconts
%   {fig:example}
%   {\caption{Example Image}}
%   {\includegraphics[width=0.5\linewidth]{example-image}}
% \end{figure}


\begin{table}[htb]
% The first argument is the label.
 % The caption goes in the second argument, and the table contents
 % go in the third argument.
\floatconts
% \centering
    {tab:ablation}
    {\caption{
        Additional evaluation for the incidental finding management of the liver. The explanatory part of the model is shown on the left, displaying the accuracy of the decision trajectory for obtaining the final recommendation matched to the reasons provided in the report. On the right is a comparison of the full and ablated model, where the segmentation base routine is removed and replaced by the VLM.
    }}
    {\begin{tabular}{|c|c|c|}
    \hline
    \multicolumn{3}{|c|}{\textbf{Explanation Evaluation}} \\ \hline
    \textbf{} & \textbf{INFORM-CT} & \textbf{Pure MERLIN} \\ \hline
    Acc  & 54.76                 & 4.76                  \\ \hline
    \end{tabular}
    \hspace{0.5cm} % Add space between the two tables
    \begin{tabular}{|c|c|c|}
    \hline
    \multicolumn{3}{|c|}{\textbf{Recommendation Evaluation}} \\ \hline
    \textbf{} & \textbf{   Full   } & \textbf{Ablated} \\ \hline
    Acc  & 63.09                 & 20.45                  \\ \hline
    \end{tabular}
    }
\end{table}
% \vspace{-1.3cm}