\section{Method}
\label{sec:method}


% We are interested in the medical image understanding task, where the input is image... and output is... n classes for Camelyon and m classes for ISIC...

% Here, important desiderata are for models to generalize at test time, be data-efficient, and interpretable... Hence, we propose our framework...

% 3.1 Clinically-informed tool-use framework: our full framework consists of three main components (See Figure)... We assume access to a clinically-informed toolbox... We use a VLM as a tool selector... Then we train a Tool Bottleneck Model... We detail each section below.

% 3.2 Toolbox: our tools are designed to incorporate medical domain knowledge... have pixel-level outputs... 

% 3.3 VLM as Tool Selection Prior: we use a VLM to inform tool selection... then do tool drop out augmentation...

% 3.4 Tool Bottleneck Model: our final predictor is composed of... and trained with... \\

We are interested in the medical image understanding task, where the input is an image $\bm{x}$, and the goal is to predict an image-level label $y$. 
In this work, we consider two clinical prediction settings: tumorous tissue detection in histopathology patches and cancer prediction in dermatology images. 

To adapt the interpretability and grounding benefits of tool-use to medical image understanding, we propose the \frameworkname{} (\frameworkabbr{}), which is composed of three components:
(1) A toolbox of pretrained medical imaging tools (Section~\ref{sec:toolbox}),
(2) a vision-language model (VLM) that serves as a tool selector (Section~\ref{sec:vlm}), and 
(3) a novel Tool Bottleneck Model (TBM) that makes the final prediction from tool outputs (Section~\ref{sec:tbm}).  

Figure~\ref{fig:arch} provides an overview of this framework, and we describe each component in detail below.
% Across these settings,  important desiderata include (i) robustness to distribution shift,  (ii) data-efficient learning from limited labeled samples, and (iii) interpretability that enables  direct, clinically meaningful interrogation of model reasoning. 


\subsection{Toolbox of Clinically-Relevant Tools}
\label{sec:toolbox}

% A plethora of clinically informed tools have been developed by domain-experts to help inform medical predictions. 
% \joy{Can we add more content for sections 3.1 and 3.2? They're a bit short right now.}
We assume access to a toolbox of $N$ tools, 
$
\mathcal{T} = \{ t_1, \ldots, t_N \},
$
where each tool $t_i$ outputs clinically-relevant features for a given input image. 
For a given image and task of interest, we assume there exists a subset of tools that extracts clinically-relevant information for solving the task; for example, nucleus segmenter and mitotic figure detector in tumorous tissue detection, or lesion segmenter and brown color detector in skin lesion diagnosis.

The toolbox may consist of tools designed for a variety of modalities.
Furthermore, each tool $t_i$ may produce an output $\bm{z}_i = t_i(\bm{x})$ of varying structure, including a scalar (e.g., lesion count, predicted probability), a vector (e.g., bounding box coordinates), or a spatial map (e.g., segmentation masks, probability maps). Our goal, given this toolbox, is two-fold: to select the relevant tools for a given task and to fuse their outputs to make a prediction. 

In our experiments, we build a toolbox $\mathcal{T}$ composed of $N=12$ tools for all tasks.
For histopathology tools, we use the open-source models from the TIAToolbox~\cite{pocock2022tiatoolbox} computational pathology toolbox, specifically the HoVer-Net~\cite{graham2019hover} model, which provides nuclear instance segmentation and classification. 
We convert these predictions into five spatial feature maps: \textit{bounding boxes}, \textit{centroids}, \textit{nucleus types}, \textit{type probabilities}, and \textit{contours}. 
% Each of these outputs is rasterized into a $96 \times 96$ pixel-feature map aligned with the corresponding image. %, yielding a set of nucleus-level tool channels that encode morphology and spatial organization. 
% In addition, we include a tissue segmentation mask obtained from a HistomicsTK-based tissue detector as a coarse region-of-interest tool, and a mitotic figure detector (DeepMitosis / DeepSeg) as a specialized tool for mitotic activity. 
% All tool outputs are represented in the same spatial grid as the original Camelyon patches, enabling them to be stacked or processed independently within the TBM framework.
For dermatology tools, we consider three categories of tools: lesion segmentation, dermoscopic structure maps, and color-based markers.
In total we have 7 tools: \textit{lesion segmenter}, \textit{pigment network}, \textit{negative network}, \textit{streaks}, \textit{milia-like cysts}, \textit{malignant-colored pigment marker}, and \textit{brown pigment marker}. 
% These tool outputs are rasterized into $224 \times 224$ feature maps.
% These are also rasterized as $H \times W$ maps aligned with the dermoscopic image and appended as additional channels in the tool stack. 
% Altogether, the lesion mask, dermoscopic structure maps, and color-based markers provide a diverse set of pixel-level tools that encode clinically relevant morphology and pigmentation patterns for TBM.
We provide additional details on all tools in Appendix~\ref{app:tool_details}.

% A \textbf{toolbox} of $N$ pretrained tools $\mathcal{T} = \{t_1, ..., t_N\}$, consisting of a set of pretrained, clinically meaningful tools, individually-validated ($t_i$). These tools might pertain to different modalities (e.g. histopathology, dermatology), and a subset of them are assumed to be pertinent to the task at hand. 

\subsection{Vision-Language Model as a Tool Selector}
\label{sec:vlm}

% In lieu of any prior knowledge of the optimal tool selection for a given image $\bm{x}$, one can design uninformative priors for tool selection (e.g. Bernoulli sampling per tool).
We formulate tool selection as selecting the $k$ most relevant tools in the toolbox for a given task and image.
% To do this, we propose distilling implicit clinically-informed knowledge from a medical VLM.
We use a medical VLM to perform this tool selection.
Specifically, the VLM is prompted with $k$, $\bm{x}$, the toolbox $\mathcal{T}$, and a natural-language description of the task, and it returns a subset of tools $\mathcal{T}_s \subseteq \mathcal{T}$ that can best solve this particular task.

An example prompt is provided below, and an example VLM-selection output is provided in Figure~\ref{fig:arch}:
\begin{lstlisting}
TOOLBOX = {
  "histo_nuc_centroid":  "Returns each nucleus centroid in 
                         [x_center, y_center]",
  "histo_nuc_bbox":      "Returns each nucleus bounding box in 
                         [x_top_left, y_top_left, width, height]",
  ...

  "derm_lesion_segmenter":         "Segments lesion ROI",
  "derm_pigment_network":          "Detects reticular pigment 
                                    network",
  ...
}    

You are a medical expert in {modality}. Select tools for a single   
task from a fixed toolbox {TOOLBOX}. 
Choices must depend on the task and image evidence.
Choose max {k} tools from the toolbox {TOOLBOX}
that are most relevant for solving the task in each image; no 
duplicates.
\end{lstlisting}

The full prompt is provided in Appendix \ref{app:prompts}.
Besides a fixed top-$k$ tool selection, we also experiment with a dynamic variant, and present the results in Appendix~\ref{app:vlm_prompting}.

\subsection{Tool Bottleneck Model (TBM)}
\label{sec:tbm}
We propose the TBM for making predictions given VLM-selected tools, which may be seen as an extension of the Concept Bottleneck Model (CBM) for pixel-level tool outputs common in medical imaging. 
A TBM takes as input $\bm{x}$ and the selected tools $\mathcal{T}_s$, and it outputs the final image-level prediction: $y = \text{TBM}(\bm{x}, \mathcal{T}_s)$.
Like CBM, it does so in two steps.
First, it computes the tool bottleneck layer $\bm{z}$, which is the set of all tool outputs for the given image $\bm{x}$ across all provided tools.
Then, a neural network $f_\theta$ takes in $\bm{z}$ and outputs the final prediction:
$
y = f_\theta(\bm{z}).
$
During training, all tools are frozen and only the parameters $\theta$ within the TBM are learned.
% We do not train the tools in order to ensure that each tool remains clinically-relevant and interpretable.

    % The TBM takes the selected tool outputs and fuses them with a CNN feature extractor $f_\theta$ before outputting the final diagnosis $\bm{y}$. To ensure that the network $f_\theta$ can operate under any subset of tools $\mathcal{T}_s$, we implement \textit{tool dropout} augmentation.
% \end{enumerate}

% In the above training setup, the TBM must be trained and tested on all available tools.
% However, not all tools may be relevant for each image.
% In the simplest case, given a toolbox comprising tools for different modalities, the selected tools should be specific to that modality, like selecting nuclei contour segmenter or mitotic figure detector for a given histopathology patch.
% Even within a specific modality, specific tools might be more relevant than others -- for example, certain cancers might benefit more from detecting atypical nuclei than mitotic figures.


We propose a simple yet effective implementation of $f_\theta$ that (1) effectively fuses the pixel-level features across tools and (2) accepts any arbitrary selection of tools $\mathcal{T}_s$.
%
To address (1), we rasterize all $\bm{z}_i$ as pixel-feature maps of size $(C_i \times H \times W)$, ensuring spatial correspondence across tool outputs.
All tool maps are concatenated along the channel dimension to form $\bm{z} = \text{Concat}(\bm{z}_1, \bm{z}_2, ..., \bm{z}_N) \in \mathbb{R}^{C \times H \times W}$, where $C = \sum_i C_i$.
Then, we implement $f_\theta$ as a CNN feature extractor followed by a final fully-connected classifier.

To address (2), we train TBMs with \textit{tool knockout} augmentation~\cite{nguyen2025knockoutsimplewayhandle}.
Specifically, given a tool selection $\mathcal{T}_s$, we replace the outputs of non-selected tools $\mathcal{T} \setminus \mathcal{T}_s$ with a fixed placeholder value $\bm{\bar{z}}_i$ (e.g., a constant map of -1s).
% At test time, non-selected tool outputs are replaced with $\bm{\bar{z}}$. 
Prior work has shown that this knockout strategy is equivalent to an implicit multi-task objective that jointly learns estimators of $y$ conditioned on all tools and its subsets (see Appendix~\ref{app:tool_knockout}).
% Thus, the bottleneck is defined by the choice of $M(x)$, which may depend on the image, the VLM-selected tool set, or random dropout. 
% During training, tool dropout is applied so that the model learns to operate reliably under varying tool availability.
%
% different TBM variants correspond to different masking regimes, which we describe in the following section.
Tool knockout also permits a ``leave-one-tool-out'' analysis, which provides the user a notion of tool importance useful for model interpretability.
We demonstrate this in Section~\ref{sec:interpretability}.

In our experiments, we found that random perturbation of tool selection during training improved results.
Specifically, instead of directly passing the selected tools $\mathcal{T}_s$ to the TBM, we sample from a Bernoulli distribution with parameter $p = (1-\alpha)0.5 + \alpha s_i$ for each tool; here, $s_i \in \{0,1\}$ is a binary selection indicator denoting whether or not the VLM selects tool $i$, and $\alpha \in [0, 1]$ is a hyperparameter that controls the strength of the VLM prior.
% increase tool robustness during training, for each image, tool selections are perturbed -- VLM-selected tools are selected with a probability of 0.95, while non-VLM-selected tools are selected with a probability of 0.10.
% The resulting perturbed tool selection is used during training as $\mathcal{T}_s$. 
We ablate this perturbation in our experiments (see also Appendix~\ref{app:tbf_training_ablations}).
% This combines the interpretable and modular nature of TBMs with medical-domain expertise of large VLMs.

% For a given image $\bm{x}$, we define a natural-language task description and a toolbox $\mathcal{T}$, and prompt the VLM to select a subset of tools it deems most relevant for solving the task.
% \begin{equation}
    % \mathcal{T}_s = \text{VLM}(\bm{x}, \mathcal{T}, p) \subset \mathcal{T}
% \end{equation}
% These selections are then converted into per-image priors over tools and used to construct the masking regimes described in Section~\ref{sec:model-variants} (VLM–Exact, VLM–Stochastic, VLM–Global).
% For Camelyon17-WILDS, the toolbox consists of nucleus-level and tissue-level histopathology tools derived from HoverNet and tissue detection:
% \texttt{histo\_nuc\_centroid}, \texttt{histo\_nuc\_bbox}, \texttt{histo\_nuc\_contour}, \texttt{histo\_nuc\_type}, \texttt{histo\_nuc\_type\_prob}, and \texttt{histo\_tissue\_segmentor}. 
% We prompt MedGemma with a short task description and the toolbox, for example:
% \begin{verbatim}
% Task: Determine if the central 32x32 region of this 96x96 histopathological 
% image contains any tumor tissue.
%
% TOOLBOX = {
    % <tool_name>: <tool_description>,
    % ...
% }
% \end{verbatim}
% MedGemma is instructed to select a fixed number of tools from this set. 
% Its selections define a binary mask $\bm{m}$ over tools for each image, and we explore different ways of integrating selected tools with TBM. %which we use to build the VLM–Exact, VLM–Stochastic, and VLM–Global masking regimes for Camelyon17.

% For Camelyon17-WILDS, the toolbox consists of nucleus-level and tissue-level histopathology tools derived from HoverNet and tissue detection:
% \texttt{histo\_nuc\_centroid}, \texttt{histo\_nuc\_bbox}, \texttt{histo\_nuc\_contour}, \texttt{histo\_nuc\_type}, \texttt{histo\_nuc\_type\_prob}, and \texttt{histo\_tissue\_segmentor}. 
% We prompt MedGemma with a short task description and the toolbox, for example:
% \begin{verbatim}
% Task: Determine if the central 32x32 region 
% of this 96x96 histopathological image 
% contains any tumor tissue.

% TOOLBOX = {
%     "histo_nuc_centroid",
%     "histo_nuc_bbox",
%     "histo_nuc_contour",
%     "histo_nuc_type",
%     "histo_nuc_type_prob",
%     "histo_tissue_segmentor"
% }
% \end{verbatim}
% MedGemma is instructed to select a fixed number of tools from this set. Its selections define a binary mask over tools for each image, which we use to build the VLM–Exact, VLM–Stochastic, and VLM–Global masking regimes for Camelyon17.


% For ISIC 2017, the toolbox is constructed from the lesion segmentation tool, dermoscopic structure maps, and color-based markers:
% \texttt{derm\_lesion\_segmenter}, \texttt{derm\_pigment\_network}, \texttt{derm\_negative\_network}, \texttt{derm\_milia\_like\_cyst\_detector}, \texttt{derm\_streaks\_detector}, \texttt{derm\_marker\_malignant\_union}, and \texttt{derm\_marker\_browns}. 
% We use task-specific prompts depending on the classification problem. For the melanocytic vs.\ non-melanocytic task, we use:
% \begin{verbatim}
% Task: Determine if the lesion in the dermoscopic image 
% is melanocytic or non-melanocytic.
% \end{verbatim}, and for the benign vs.\ malignant task task we use:
% \begin{verbatim}
% Task: Determine if the lesion in the dermoscopic image 
% is malignant or benign.
% \end{verbatim}
% For our toolbox we have:
% \begin{verbatim}
% TOOLBOX = {
%     "derm_lesion_segmenter",
%     "derm_pigment_network",
%     "derm_negative_network",
%     "derm_milia_like_cyst_detector",
%     "derm_streaks_detector",
%     "derm_marker_malignant_union",
%     "derm_marker_browns"
% }
% \end{verbatim}
% As in the Camelyon17 setting, MedGemma’s per-image tool selections define the VLM-guided binary masks $M(x)$ used in the TBM–VLM-Exact, TBM–VLM-Stochastic, and TBM–VLM-Global variants for ISIC 2017.


% % \subsection{Model Variants}
% \label{sec:model-variants}
% In subsequent sections, we refer to TBM models with early fusion as $\TBMe$ and TBM models with late fusion as $\TBMl$.
% In subsequent sections, we refer to TBM models which uses all available tools as simply TBM. 
% We refer to TBM with VLM tool selection as TBM-VLM.
% The VLM outputs an image-level tool mask 
    % in which only tools selected by MedGemma for that image are kept:
    % \[
    % \bm{m}_i(x) = \mathbf{1}[\text{tool } i \text{ selected by VLM}].
    % \]
     % This variant reflects a scenario where the VLM perfectly identifies the relevant tool set for each case. 
    % 
    % \item \textbf{TBM-VLM Global:} Instead of image-level mask, a dataset-level mask is used in which each tool is retained according to its empirical VLM selection frequency across all training images:
    % \[
    % \bm{m}_i(x) \sim \mathrm{Bernoulli}(p_{\mathrm{keep},i}), 
    % \qquad
    % p_{\mathrm{keep},i} = \Pr(\text{VLM selects tool } i)
    % \]
    % Each tool is retained according to this dataset-level $p_{\text{keep},i}$. This tests whether global task priors suffice compared to instance-specific guidance.
% Note that TBM--VLM may be suboptimal if there is distribution shift.
    % \[
    % \bm{m}_i(x) \sim \mathrm{Bernoulli}(p_{\mathrm{sel}}) \text{ if tool } i \text{ is selected}, 
    % \]
    % \[\bm{m}_i(x) \sim \mathrm{Bernoulli}(p_{\mathrm{unsel}}) \text{ otherwise}
    % \]
    % This TBM variant leverages MedGemma's tool selection as a stochastic, per-image prior. It exposes the TBM to diverse tool combinations during training, including unseen ``OOD" subsets, promoting robustness to variable tool availability or image-dependent tool selections. 
% We refer to this variant as TBM-VLM Perturbed.

% \end{itemize}
% In addition to the proposed VLM-TBM variants, we also evaluate several variants of the TBM framework to disentangle the effects of tool availability and VLM-guided dropout:
% \begin{itemize}
    % \item \textbf{TBM All Tools (No Bottleneck):} All tools are always kept ($M(x) \equiv \mathbf{1}$). This model uses the full tool set at both training and inference and therefore represents an upper-bound ``Tool Model’’ without any selective bottleneck, measuring the benefit of full tool supervision and no dropout. This base model does not learn to handle missing or selective tools.

    % \item \textbf{TBM All Tool, VLM-Selected Tools for Validation:} The model is trained without dropout (all tools available), but at evaluation time only the tools selected by the VLM (MedGemma) are kept for each image. This allows us to isolate the effect of restricting inference to VLM-relevant tools without altering the training process.
    
    
    % \item \textbf{TBM Random:} A control variant where all tools share a uniform keep probability $p_{\mathrm{keep}} = 0.5$, independent of VLM information:
    % \[
    % \bm{m}_i(x) \sim \mathrm{Bernoulli}(0.5)
    % \]
    % This serves as an ablation to measure how much of the model's robustness and interpretability derives from informed VLM priors versus uniform random dropout.

% \end{itemize}