\section{Method}
\label{sec:method}

The proposed method aims to automate the management of incidental findings on abdominal CT scans for multiple abdominal organs, based on \texttt{PDF}s of medical guidelines. This entire process is performed end-to-end automatically using our planner-executor framework. The framework utilizes the parsed guidelines (stored in a \texttt{JSON} file) and available protocols to generate and execute the necessary code for inspection. An overview of the full pipeline is shown in ~\figureref{fig:overall_pipeline}.


\input{figures/pipeline}

\subsection{Parsing Guidelines}
\vspace{-0.1cm}

\label{sec:parsing_guidelines}
We begin by parsing the medical guidelines, which often come in \texttt{PDF} format, into decision trees that include multiple checks and detections leading to recommendations. For this parsing stage, we used LLM (GPT-4o~\cite{openai2023gpt4o}), and the LangChain framework~\cite{Chase_LangChain_2022}, to analyze figures, tables, cross-references, footnotes, and \texttt{PDF} text, converting them into \texttt{JSON} formats applicable for later stages. An example of a \texttt{PDF} and the parsed tree is shown in ~\figureref{fig:parsed_trees}.
%
\input{figures/parsed_trees}
%
The parsed \texttt{JSON} file contains structured information extracted from the guidelines, including checks, detections, measurements, and recommendations. This structured format allows for easy integration into the planner-executor framework.


\subsection{Planner-Executor Framework}
Using the parsed guidelines (stored in a \texttt{JSON} file) and available protocols, we implemented a planner-executor framework:
\begin{itemize}
    \item \textit{Planner}: The planner, a ReAct \cite{Yao2022ReActSR} agent set up on Claude 3.5~\cite{anthropic2024claude} (selected for its strong code generation capabilities), generates a Python script using a set of predefined base functions. It utilizes the parsed guidelines to create the script.
    \item \textit{Executor}: The executor runs the generated Python script, triggering the inner base functions.
\end{itemize}

At the core of this framework is the code generation of a Python script designed to inspect incidental findings based on medical guidelines. The challenge lies in the complex structure of the decision trees from Section \ref{sec:parsing_guidelines} and the variety of visual subroutines involved in this inspection. For instance, a single program might include the detection of a tumor mask, the calculation of its diameter (in \texttt{mm}), the measurement of its border thickness, tumor gray-level evaluation (in Hounsfield units), and the presence of higher-level attributes assessed by a CLIP classifier --- all in addition to the logical options inherent in the Python script itself. These complex requirements demand extensions of existing plan-and-execute methods into more sophisticated programs with an expanded set of base functions.

A representative example of a synthesized program derived from the ACR liver guidelines is shown in Algorithm 1 (Appendix A), illustrating the type of clinical logic produced by our planner–executor framework.

\subsubsection{Base Functions.}
The base functions are built on existing methods, models, and detectors for segmentation and detection of CT organs, such as abdomen CT segmentation models, abdomen CLIP models, and image processing procedures. These functions include:
\begin{itemize}
    \item \textit{Organ Segmentation}: Segmenting organs in the CT scan. Based on TotalSegmentor~\cite{wasserthal2023totalsegmentator}, and nnUNet~\cite{isensee2021nnu} frameworks.
    These include multiple different segmentation models that cover a wide range of tasks, including organ and tumor segmentation in the abdomen.
    %
    \item \textit{Mass and Tumor Segmentation}: Detecting and segmenting masses and tumors. Based on ~\cite{isensee2021nnu,wasserthal2023totalsegmentator} as well. 
    %
    \item \textit{Measuring tumor diameter}: An image processing procedure to measure the diameter (in \texttt{cm} or \texttt{mm}) of a tumor based on a mask of pixels and metadata from the CT resolution. Includes a few estimation methods.
    %
    \item \textit{Measuring gray-level intensity}: An image processing procedure to measure the gray-level intensity (\texttt{HU}) of a tumor based on a mask of pixels and metadata from the CT scan file.
    %t
    \item \textit{Measuring border thickness}: Measuring the thickness of organ or lesion borders using the Hausdorff distance.
    %
    \item \textit{Labeler}: A labeler module is integrated to automate the classification of higher-level fine-grained attributes using a vision-language model. For example, the labeler can tag a lesion as "benign", "suspicious", or "flash-filling" according to a list of sub-features. Our labeler is implemented using the MERLIN model~\cite{blankemeier2024merlin}, which is currently the state-of-the-art 3D model for abdominal CT. It was trained on paired 3D CT volumes and corresponding text reports, enabling it to generate accurate labels for segmented regions on these scans.
    %A labeler module is integrated to automate the annotation process, employing a vision-language model. This model has been trained on paired CT scans and corresponding text reports, allowing it to generate accurate labels for segmented regions on these scans. 
    %A detailed description of the design and selection of the labeler module will be provided in the following.
\end{itemize}


\subsubsection{Incidental Findings Code Generation.}
\label{sec:code_generation}

Code generation models such as ~\cite{gupta2023visual,suris2023vipergpt} have demonstrated the successful use of creating programs as a description of complex decision-making and analysis processes. However, clinical detection of incidental findings in abdominal CT scans involves a more challenging task. This process must account for multiple critical factors that are not typically used for normal images, including computation of size and grey-level intensity in specified regions, considering scan details such as contrast phase, and often incorporating patient medical history. These complexities necessitate a robust and adaptable code generation approach to ensure accurate and efficient analysis.

To generate a code representation of each incidental findings management procedure, we provided a detailed description of the API available for each base function, along with simple examples that demonstrate their proper usage. In addition, we included a comprehensive overview of the problem, the clinical pipeline, the parsed tree from Sec.~\ref{sec:parsing_guidelines} as well as other relevant details to effectively instruct the code generation process.

The generation process works in an interactive manner, involving a multi-turn conversation with the LLM. The steps are as follows:
\begin{enumerate}
    \item \textit{Initial Draft Generation:} The agent generates a preliminary draft of the program out of the decision tree.
    \item \textit{Execution and Feedback:} After executing the draft, the system generates feedback and evaluates a STOP criterion to decide whether to continue. This criterion assesses both syntactic correctness and semantic validity of the generated code.
    \item \textit{Iterative Refinement:} If the STOP criterion is not met, another call to the LLM is made to regenerate the code based on the feedback. The method then returns to step 2, iterating through these steps until the final program is produced.
\end{enumerate}








% I suggest to mention here:
% (*) To tune the LLM for generating code, we provided examples (in-context learning) and referred to the API of base functions. This ensures that the generated Python script adheres to the required specifications and performs the necessary checks and detections.
% (*) Iterative code generation, and verification of agent
% (*) The integration of use of guidelines, as knowledge-based that is external to the LLM. 
% (*) maybe expressions

%A sample (will be later on refined)

% IDAN, I am integrating this into paper:

% Code generation models such as ~\cite{suris2023vipergpt,gupta2023visual} demonstrated the successful use of creating programs as a description of complex decision-making and analysis processes. 

% To generate a code representation of each diagnostic procedure, we provided a detailed description of the API available for each  base function with some simple examples of how to properly use some of them, as well as a general description of the problem, the clinical pipeline, etc. 

% The pipeline works against the llm in an \textit{intreactive} manner (multi-turn conversation or chat). On each step, the agent is going through the following procedure:
% \begin{description}
%     \item[Thought.] The agent performs situation analysis and strategy planning.
%     \item[Action.] Based on the planning phase, a specific action is taken. In our specific context, the action is writing a draft of the program (or some of its building blocks).
%     \item[Observation.] Processing the results of the action. For this step, the agent is encouraged to use tools that give feedback and insights on the program's syntax, validity, and clinical correctness.
% \end{description}
% After repeating the last three steps, in a certain position, the agent comes into its \emph{final decision} which is the generated program.

% similar to what done in ~\cite{Yao2022ReActSR, yang2023mm}

% Clinical detection of incidental findings in abdominal CT scans involves a complex decision-making process that must account for multiple critical factors. These include technical considerations such as contrast phases, patient-specific variables like age and medical history, identifiable risk factors, and findings from previous imaging studies. 




% \TODO{Idan ; Add advantages of our pipeline on other code generation models such as Viper}
% \TODO{Add references; maybe formulas and fig}
