\section{Introduction}

Incidental findings on abdominal CT scans are common and may have important clinical implications. Therefore, it is crucial to report these findings in an actionable manner, adhering to established guidelines.
%
Virtually every scan reveals incidental findings, making it essential to distinguish significant findings from background noise. This paper aims to address the clinical concern of managing the overwhelming number of findings, especially in older individuals where incidental findings are prevalent.
% 
% Traditional methods, which rely heavily on manual inspection by radiologists, can be time-consuming and subject to variability. 
% % method overview:
% Recent advances in machine learning have introduced opportunities for novel approaches to improve the accuracy and efficiency, for incidental findings reporting. These approaches leverage state-of-the-art large language models (LLMs) and vision-language models (VLMs) to analyze medical images and associated textual data. By integrating these advanced models, it is possible to improve the detection of abnormalities and provide more reliable diagnostic tools. 
We propose a novel framework based on LLM combined with VLM in an agent framework, particularly of a plan-and-execute style, to improve the efficiency and precision of automatic incidental findings analysis in abdominal CT imaging, adhered to medical guidelines.

% RELATED WORK:
%%%% RELATED WORK on INCIDETNAL FINDINGS
Traditional methods for incidental findings detection on abdominal CT rely on manual inspection by radiologists, which can be time-consuming and prone to variability~\cite{berland2010managing}. In the past decade, deep learning-based medical anomaly detection has emerged as a relevant approach. These methods often aim to learn the distribution of normal patterns from healthy subjects and detect anomalous ones as outliers, for instance, via autoencoders or generative adversarial networks (e.g.,~\cite{zhang2023model,schlegl2019f,akcay2019ganomaly,shvetsova2021anomaly,almeida2023coopd}). 
Other relevant models target segmentation or detection of specific types of incidental findings (e.g. liver mass~\cite{lyu2024superpixel}). However, none of these methods propose a general-purpose approach for detecting multiple incidental findings across various organs, such as in abdominal imaging. 
%
%Our work differentiates itself by leveraging state-of-the-art LLMs, VLMs, and computer vision subroutines to propose a general-purpose method for efficient incidental findings management in abdominal CT adhered to rafiguidelines.

%%%% RELATED WORK on VL:
Vision-language multimodal approaches have shown promise in enhancing the detection of pathologies by leveraging both visual and textual information. 
%
CLIP ~\cite{radford2021learning} efficiently learns visual concepts from natural language supervision, enabling zero-shot transfer capabilities. 
For medical 3D inputs, CT-CLIP~\cite{hamamci2024developing} and BIMVC~\cite{chen2024bimcv} focus on chest CT volumes, pairing them with radiology text reports to improve diagnostic accuracy. MERLIN~\cite{blankemeier2024merlin}, designed for abdominal CT, integrates textual and 3D visual data to provide comprehensive insight into abdominal imaging. These models collectively advance the field of medical imaging by combining visual and textual information, improving zero-shot classification tasks without additional annotations. However, these models still struggle to perform complex diagnostic tasks, and as we show here, they can be significantly augmented when paired with LLMs and computer vision sub-routines in an agent-based framework.

%\subsubsection{Related work on Planner-Executor Systems.} 
Planner-executor systems automate complex tasks by generating and executing code based on pre-defined instructions. Recent advances in plan-and-execute frameworks have paved the way for the integration of LLM-powered agents. These agents can plan and perform actions, enhancing the overall efficiency and accuracy of task execution.
%
The majority of computer vision work for such systems focuses on visual question answering (VQA). For example, models such as~\cite{suris2023vipergpt,khan2024self,gupta2023visual} leverage code-generation models as well as vision-language models such as CLIP into subroutines, producing results for any query by generating and executing Python code. 
More advanced methods integrate a planner, reinforcement learning agent, and reasoner for reliable reason (e.g.,~\cite{ke2024hydra}) or use a multi-turn conversation and feedback (e.g.,~\cite{Yao2022ReActSR,min2024morevqa}).
%
In the context of incidental findings detection, such systems can ensure the adherence to clinical protocols and improve the efficiency of the inspection process. 
To our knowledge, we are the first to apply this approach of code generation and execution for CT diagnosis, providing a novel and interpretable solution for medical imaging analysis. 
%
% While the VQA work mentioned above is often limited to very few base functions (e.g., object detectors and classifiers like CLIP), as well as relatively short generated programs, this medical imaging challenge requires complex programs and involves base functions for image processing (e.g., size, edge, gray-level, etc.). Therefore, we turn here for a careful design of the plan-and-executor model.

While prior VQA-based approaches typically rely on a small set of base functions (e.g., object detectors, CLIP) and  produce short programs, our setting requires substantially more complex programs together with low-level image processing primitives (e.g., size, edge, intensity), which motivates a careful design of the underlying plan-and-executor architecture tailored to the medical imaging domain.

% contributions:
To conclude, our contributions in this paper are as follows:

(i) We are the first to propose an incidental findings pipeline for the entire abdominal region, based on an LLM and VLM agentic approach. This pipeline is general, automatically created, and adheres to clinical protocols and guidelines. 

(ii) We propose a \textit{plan-and-execute} program generation method, which starts from a \texttt{PDF}, and automatically generates and executes a robust Python program with multiple visual subroutines (base functions) that predict clinical recommendations.

(iii) We introduce a benchmark and a new method to create test examples for incidental-finding recommendations, based on 
Abdominal-CT reports.

% \todo{additional contribution of the programming based approach is the generalisation ablity to the new clinical guidelines. This ability does not exist in papers like VILA-M3, VoxelPropmpt or other visual instruction tuning approaches, which struggle to generalise beyond the training-data instructions}

%We also provide a benchmark including incidental findings annotations to evaluate automatic incidental findings reporting methods.

% organization of paper:
% The rest of this paper is organized as follows.
% Sec.~\ref{sec:related_work} reviews existing research in incidental findings detection, vision-language (multimodal) learning in medical imaging, and planner-executor systems. Sec.\ref{sec:method} describes our proposed method, including the parsing of guidelines, the planner-executor framework, and the development of the labeler. Sec.\ref{sec:exps} presents our experimental setup and results, demonstrating the effectiveness of our method on benchmark datasets. Sec.~\ref{sec:discussion} summarizes the findings, discusses the implications of our work, and outlines potential future research directions. 
