\section{Introduction}


\label{sec:introduction}

\begin{figure}[htbt]
 % Caption and label go in the first argument and the figure contents
 % go in the second argument
\floatconts
  {fig:arch}
  {\caption{Overview of our proposed \frameworkname{}. 
  A VLM selects tools from a pre-specified toolbox of clinically-relevant tools.
  These tools are passed to a Tool Bottleneck Model, which composes/fuses the tool outputs to make a prediction.
  }}
  {\includegraphics[width=\linewidth]
  % {figs/systems_v1.pdf}}
  {figs/fig_1/systems_v11}}
\end{figure}

Human clinical decision-making from medical images typically involves using domain knowledge to analyze or extract multiple clinically-relevant, often spatially-localized features and subsequently integrating them to make a prediction (such as diagnosis). 
% \joy{Can we not use the exact same sentence here as in the abstract?}
For example, in dermatology, pigment networks and border irregularity are important in diagnosing malignant melanoma~\cite{anantha2004detection,stolz1994abcd}. 
As another example, nuclei count and irregularity are known to be correlated with tissue pathology~\cite{kuenen1984prognostic}. 
For any given task, two questions arise: (1) which features are most relevant, and (2) how should these features be integrated to make a prediction?

Broadly, deep learning approaches leveraging general-purpose architectures such as convolutional neural networks (CNNs) and vision transformers (ViTs) address these two questions in a single unified pipeline, taking the raw image as input and training highly parameterized feature extractors and task predictors in an end-to-end fashion. 
These architectures provide the backbone for both image classification models and more recent vision-language models (VLMs)~\cite{wang2022medclip,sellergren2025medgemma}.
% However, these approaches suffer from several drawbacks compared to human clinical decision-making.
However, they typically require large amounts of data to perform well due to their high number of learnable parameters.
Additionally, they may be more challenging to interpret due to their end-to-end, black-box nature.
% there has been a proliferation of AI/deep-learning-based tools, capable of extracting complex and human-interpretable features 
% These include pixel-level features like segmentations, bounding boxes, and contours.
% Similarly, image-level features like regression and classification.
% These tools have primarily been driven by black-box methods such as deep convolutional networks (ResNet, DenseNet, EfficientNet) or ensemble-based machine learning classifiers (e.g. ...) -- they often achieve high accuracy but fail to generalize in medical imaging due to overfitting to dataset-specific, spurious correlations.  

A promising line of work for overcoming the limitations of black-box deep learning includes \textit{tool-use frameworks}. 
% that can be directly examined and intervened upon.
Broadly, tool-use frameworks leverage VLMs and a pre-specified toolbox to decompose the prediction task into multiple tool calls, which are then composed to make a prediction.
% \footnote{In the literature, when the tool represents a scalar output, these are typically referred to as ``concepts.'' In this work, we view tools as generalizations of concepts.} 
% For example, whether or not a tissue sample is benign or malignant is dependent on factors like nucleus count and shape, mitotic figure count, etc.
Approaches like VisProg~\cite{gupta2023visual} and ViperGPT \cite{suris2023vipergpt} perform visual understanding by using VLMs to output text (e.g., natural language or code) that composes tools.
These tools are drawn from an extensive toolbox of modules that extract image-level and pixel-level features, such as segmentation or detection.
However, these methods often fail to perform well on medical images, where text-based composition lacks the granularity for fusing fine-grained, spatially-localized features.

To address this, we propose the \frameworkname{} (\frameworkabbr{}), a tool-use framework for medical image understanding that composes VLM-selected and clinically-informed tools with a learned Tool Bottleneck Model (TBM). 
Our framework is illustrated in Figure~\ref{fig:arch}. 
% \joy{Similar to the abstract, I strongly feel that this sentence above should start with ``in contrast'', to differentiate our work from prior tool-use frameworks. }
First, for a given task and image $\bm{x}$, a VLM selects the best tools from a pre-specified toolbox. 
% \ehsan{refer to Fig 1 here.}
This toolbox consists of multiple tools that each extract clinically-relevant features from a given image and modality. 
For example, the toolbox might include tools for histopathology, like nucleus segmentation or a cell typing tool, as well as tools for dermatology, like lesion segmentation or pigment network detection.
Second, our proposed TBM takes as input $\bm{x}$ and the VLM-selected tools and passes $\bm{x}$ through each tool; the tools outputs are fused by a learned neural network which outputs a prediction. 
Inspired by Concept Bottleneck Models~\cite{koh2020concept}, TBMs flow information through a bottleneck layer, but generalize this layer to handle both spatially-localized features at the pixel-level as well as image-level scalar attributes.

% Overall, \frameworkabbr{} formulates the medical image understanding task as a tool selection, where the prediction is made by ``execution'' of the tools via a neural network. 
% Each tool processes the same input image to produce outputs (e.g. segmentation maps, bounding boxes, or region probabilities) that are fused (jointly or separately) and then passed to a classifier to output a final prediction. 
% This design allows the model to learn or assign a weight to each tool output, enabling explicit quantification of how much each tool contributes to the decision.
%

% \frameworkabbr{} exhibits several advantages over CNNs, ViTs, VLM-based models or tool-use frameworks.
% First, \frameworkabbr{} inherits the interpretability and grounding benefits of tool use, without sacrificing overall performance. 
% Besides improved interpretability implicit in its architectural design, we further propose methods for quantifying the importance of each tool as well as intervening on tool outputs. 
% Crucially, these interpretability benefits do not sacrifice overall performance.
% Second, we find that \frameworkabbr{} exhibits improved performance in data-limited regimes due to its clinically-relevant inductive biases.
% generalization than black-box baselines and segmentation-based classification methods, especially in data-scarce medical imaging settings.

We find that \frameworkabbr{} not only overcomes limitations of state-of-the-art tool-use frameworks, but also yields predictors that are more interpretable and clinically-grounded.
In particular, ours is the first work to incorporate clinically-grounded priors for tool use using a neural-network based composition.
Our contributions are as follows:
\begin{itemize}[noitemsep, topsep=0pt]
\item We propose the \frameworkname{} (\frameworkabbr{}) for clinically-informed and interpretable tool-use for medical image understanding, which uses a medical VLM to select the tools most relevant for the image and task, then composes the selected tools to make a prediction with a learned Tool Bottleneck Model (TBM).
% To facilitate this, we propose tool dropout to enable the TBM to use any subset of tools at inference time.
\item We propose a simple yet effective strategy for TBMs to handle arbitrary tool selections.
\item In tasks derived from histopathology and dermatology, we demonstrate on-par to superior performance compared to state-of-the-art baselines while being more interpretable, with particular gains in data-limited regimes. % We also propose a notion of tool importance that enables a user to interrogate how important each tool is for making the final prediction.
% \joy{Should we mention interpretability here?}
\end{itemize}
% Through intervention experiments and per-tool decomposition of predictions, we show that TBMs yield inherently interpretable and actionable explanations that go beyond post-hoc interpretability.
% 4) By grounding internal representations in tool semantics, TBMs improves OOD robustness and training efficiency compared to baseline methods, specifically in limited-data settings. 
% 5) The TBM framework is modular and compatible with any collection of high-quality pretrained tools, making it extensible across many domains and modalities, particularly in limited-data medical settings, where clinicians naturally analyze and diagnose images through decomposition into interpretable components
