\section{Experiments}
\label{sec:experiments}
% \subsection{Objectives}
% \label{sec:objectives}
% We evaluate the Tool Bottleneck Model (TBM) and its variants to answer the following questions:
% \begin{enumerate}
%     \item Does decomposing images into multiple tool outputs improve performance and generalization compared to black-box models?
%     \item How does VLM-guided tool selection affect robustness under missing or mismatched tools?
%     \item Do TBMs provide interpretable and actionable/intervenable reasoning compared to post-hoc explanations?
% \end{enumerate}



\subsection{Dataset and Tools}
\label{sec:dataset_and_tools}
We report results on three medical image understanding tasks (Camelyon17, ISIC-BM, and ISIC-MN) derived from two medical imaging datasets in histopathology and dermatology.
Detailed descriptions of each dataset and tool are provided in the Appendix.

The Camelyon17-WILDS dataset~\cite{koh2021wilds} is adapted from the CAMELYON17 challenge~\cite{litjens2018camelyon}, which consists of whole-slide images (WSIs) of breast cancer metastases in lymph node sections. 
Each WSI is manually annotated by pathologists to mark tumor regions, from which non-overlapping $96 \times 96$ pixel patches are extracted and labeled as either \{tumor, normal\}. 
% A patch is labeled \textit{Tumor} if the central $32 \times 32$ region contains any tumor tissue, and \textit{Normal} if it contains no tumor and at least $20\%$ normal tissue in that central region.


The ISIC 2017 dataset~\cite{codella2018skin} provides 2000 dermoscopic images for skin lesion analysis, with three primary diagnostic categories: melanoma (malignant, melanocytic), nevus (benign, melanocytic), and seborrheic keratosis (SK) (benign, non-melanocytic). 
We evaluate two clinically-relevant binary classification tasks: 1) ISIC-BM: Benign vs.\ Malignant, where melanoma is treated as malignant and \{nevus, SK\} as benign; and 2) ISIC-MN: Melanocytic vs.\ Non-melanocytic, where \{melanoma, nevus\} are grouped as melanocytic and SK as non-melanocytic. 
% Class counts in the training set are 374 melanoma, 254 SK, and 1{,}372 nevi. 
The validation and test sets contain 150 and 600 images for ISIC-BM and ISIC-MN, respectively. 

% The official training set contains 2{,}000 JPEG dermoscopic images with ground-truth diagnoses. % and a CSV of minimal clinical metadata (image\_id, age\_approximate, sex). 
% Class counts in the training set are 374 melanoma, 254 SK, and 1{,}372 nevi. 
% The validation and test sets contain 150 and 600 images, respectively. 
% Ground-truth labels are provided via two binary indicators: a melanoma indicator (1 for melanoma, 0 otherwise) and an SK indicator (1 for SK, 0 otherwise), which we map to the two binary tasks above. 

% \subsubsection{ISIC Tools}
% \label{sec:isic-tools}
% \subsubsection{Camelyon17 Tools}
% \label{sec:camelyon-tools}
% \joy{Maybe we can move these tools into 3.1.}

\subsection{Experimental Setup}
\label{sec:exp-setup}
We use MedGemma~\cite{sellergren2025medgemma} as the tool selector for \frameworkabbr{} and set $k=3$ for all experiments. We set $\alpha=0.9$ for Camelyon17 and ISIC-BM, and $\alpha=0.8$ for ISIC-MN (See Section~\ref{sec:tbm}). These values were tuned via grid search over $k \in \{2, 3, 4\}$ and $\alpha \in \{0.5, 0.6, 0.7, 0.8 ,0.9\}$ (Appendix~\ref{app:tbf_training_ablations}) with respect to their validation sets. 
% All reported fixed top-$k$ experiments therefore use these $\alpha$ values to perturb tool selection during training.  

For Camelyon17, we subsample 5{,}000 patches to reduce computational cost while retaining patient-level diversity, drawing 100 patches from each patient (WSI) folder following the official WILDS splits. 
All models are trained on the training split and evaluated on the held-out in-distribution (ID) and out-of-distribution (OOD) hospital splits. % described in Section~\ref{sec:exp-setup}.
%
% For TBM, we stack all tool feature maps along the channel dimension to form a single multi-channel tensor of size $C \times 96 \times 96$. 
We use EfficientNet-B0 for $f_\theta$ pretrained on ImageNet, modified to accept $C$ input channels. 
% We consider both models trained from scratch and models initialized with ImageNet-pretrained EfficientNet weights; in the latter case, we replace only the input stem while keeping the remaining blocks initialized from ImageNet and fine-tune end-to-end. 
% The global feature vector from EfficientNet is passed to a linear classifier for binary tumor prediction. 
%
% For the TBM with late fusion (\textbf{$\TBMl$}), each tool (e.g., box map, centroid map, type map, probability map, contour map, tissue mask, mitosis map) is processed by its own encoder based on an EfficientNet-B0 backbone. Similar to the shared-encoder TBM, for each tool we instantiate an EfficientNet-B0 model whose input stem is replaced to accept the tool’s channel dimension while retaining the pretrained ImageNet weights for the remaining layers. The spatial feature maps produced by each backbone are globally pooled and passed through a linear projection layer to obtain a low-dimensional (scalar) embedding per tool. The resulting tool-wise embeddings are concatenated and fed into a linear classifier for binary tumor prediction. This design preserves per-tool modularity-- each tool has its own encoder that can be inspected or ablated, while still benefiting from strong pretrained features. 
% Because the final projection for each tool is one-dimensional, the classifier operates on a compact vector of tool scores, making it straightforward to interpret or intervene on individual tool contributions.
%
% In TBM--VLM models, we apply tool dropout via the binary mask $M(x)$ as described in Section~\ref{sec:model-variants}. 
% During training, the mask is sampled according to the chosen regime (TBM--VLM or TBM--VLM Perturbed), and the corresponding tool channels are either retained or replaced by a fixed null value (set to $-1$) before being passed to the encoder(s). 
% This exposes the model to varying subsets of tools during training, so that at test time it can robustly handle the restricted VLM-selected tool sets.
%
All Camelyon TBMs are optimized with cross-entropy loss using Adam with a learning rate of $10^{-4}$ and weight decay of $10^{-4}$, for 40 epochs. % on the 5{,}000-patch subset. 
We select the best checkpoint based on validation performance on the corresponding ID Val split.
Due to the Camelyon dataset being evenly balanced across classes and following prior work~\cite{koh2021wilds}, we report accuracy in all results.

For ISIC, we train TBMs on both binary tasks (ISIC-BM and ISIC-MN). % using the image, lesion mask, dermoscopic structure maps, and color-based marker maps as tools (Section~\ref{sec:tools-vlm}). 
Images are resized to a fixed input resolution of $224\times224$ and augmented with random flips.
% We restrict our experiments to the $\TBMe$ architecture, as TBM with early fusion performed better than with late fusion Camelyon17 (shown in Section~\ref{sec:camelyon-analysis}.
%
% We experiment with two choices of tool encoder for TBM with early fusion: (i) an EfficientNet-B0 backbone with a modified input stem, initialized from ImageNet and fine-tuned end-to-end (same as TBM Camelyon), and (ii) a simple CNN with a few convolutional blocks trained from scratch on the ISIC training set. In both cases, the encoder takes the stacked multi-channel tool tensor as input and outputs a global feature vector that is passed to a linear classifier. Empirically, the smaller CNN encoder yields slightly better performance than the EfficientNet-based encoder, likely due to reduced overfitting in this limited-data regime.
For $f_\theta$, we use a CNN with 4 convolutional blocks (32, 32, 64, 128 channels) trained from scratch on the ISIC training set. 
% Empirically, we find that the smaller CNN encoder yields slightly better performance than the EfficientNet-based encoder, likely due to reduced overfitting on a relatively smaller dataset. % in this limited-data regime.
% The encoder takes the stacked multi-channel tool tensor as input and outputs a global feature vector that is passed to a linear classifier. 
%
Both ISIC tasks exhibit class imbalance and we train with a class-weighted binary cross-entropy loss to handle this. 
All ISIC TBMs are trained using the Adam optimizer (learning rate $1\times10^{-3}$, weight decay $1\times10^{-4}$) with a cosine annealing schedule over 20 epochs. %, which we found sufficient for convergence given the smaller dataset size. 
For both ISIC-BM and ISIC-MN, we report area-under-the-receiver-operating-curve (AUC) in all results.
For all tasks, tool outputs are rasterized into multi-channel maps whose values lie in $\{0,1\}$  and concatenated along the channel dimension. 
% Similarly, for ISIC, the lesion, dermoscopic feature, and color-based masks are rasterized into binary $\{0,1\}$ maps and stacked into a multi-channel tensor. 
Further details are in Appendix~\ref{app:tool_details}. 
For all experiments, we set $\bm{\bar{z}}_i = \mathbf{-1}^{C_i \times H \times W}$, i.e. a constant map of -1s.

\subsection{Baselines} 
\label{sec:baselines}
% \joy{Let's order these methods as we do in Table 1. Possibly grouped into training on specialized data vs not.}
We compare against two classes of baselines. 
The first class consists of state-of-the-art, zero-shot VLMs with and without tool-use.
For all tool-use frameworks, we implement them such that they use our toolbox $\mathcal{T}$. 
\textit{MedGemma}~\cite{sellergren2025medgemma} is a closed-source, medical-VLM; it does not expose a way to integrate custom tools.
As such, we include tool outputs as additional prompts to the model, and denote this baseline as \textit{MedGemma w/ Tool Prompts}. We additionally evaluate Gemma 3 \cite{team2025gemma} zero-shot and with tool outputs as additional prompts (denoted as \textit{Gemma w/ Tool Prompts}). These additional prompts serve as pixel-level clinical priors to control for differences in pixel-level supervision and inductive biases that may not be accounted for in other baselines.
% Two losses are minimized: pixel-wise loss at the final layer and an image-level loss at the middle layer. 
{\textit{VisProg}~\cite{gupta2023visual} is a neuro-symbolic method for image understanding that uses a pretrained VLM to write executable code consisting of tool calls for reasoning over visual inputs. \textit{LLaVA-Med}~\cite{li2023llavamed} is an open-source, medical-VLM. We first evaluate it in a zero-shot setting.

The second class consists of models trained or fine-tuned on our data.
\textit{EfficientNet}~\cite{tan2019efficientnet} serves as a standard black-box CNN baseline trained directly on resized raw images without any tool inputs.
\textit{Y-Net}~\cite{mehta2018net} is an extension of U-Net~\cite{ronneberger2015unetconvolutionalnetworksbiomedical} originally proposed for joint segmentation and classification, where a pixel-level loss is applied at the final layer of the U-Net and an image-level loss is applied at the middle bottleneck layer of the U-Net. 
We train Y-Net to match the amount and types of data as our proposed model; specifically, pixel-level tool outputs are optimized at the final layer, while an image-level classification loss is minimized at the middle layer. Finally, we include a finetuned \textit{LLaVA-Med (FT)} baseline, where the pretrained LlaVA-Med backbone is finetuned on data. 
% We finetune the language model via LoRA adapters, partially unfreeze the last few vision blocks, and keeping the overall model architecture unchanged. 
% For all models, we match the experimental and training setups with our models as much as possible.
More details are provided in Appendix~\ref{app:baseline_details}.


% \1 \textbf{BACH (BreAst Cancer Histology) Dataset.}
% \2 The ICIAR 2018 BACH dataset (cite) contains 400 microscopy images of breast tissue, each labeled according to the predominant cancer type: 
% \textit{Normal} (100), \textit{Benign} (100), \textit{In situ carcinoma} (100), and \textit{Invasive carcinoma} (100). 
% All images were annotated independently by two expert pathologists, and any samples with label disagreement were discarded.
% \2 Images are stored in TIFF format with the following specifications: RGB color model, $2048 \times 1536$ pixels, pixel scale of $0.42\,\mu m \times 0.42\,\mu m$, and file sizes ranging from 10–20 MB. 
% Each image is labeled at the image level (no pixel-level annotations).
% \2 For this work, we define two classification tasks:
% \3 \textbf{(1) 4-class classification:} predict one of four tissue categories — Normal, Benign, In situ carcinoma, or Invasive carcinoma.
% \3 \textbf{(2) Binary classification:} combine Normal and Benign into a \textit{non-cancerous} class, and In situ and Invasive carcinoma into a \textit{cancerous} class. 
% This task formulation allows direct comparison with binary diagnostic setups such as Camelyon17, while retaining a finer-grained 4-class variant for analysis of multi-category pathology prediction.
% \2 All images are preprocessed and standardized using the Macenko stain normalization method. (cite Macenko)
% % \3 All images are resized to $512 \times 512$ or cropped into non-overlapping $512 \times 512$ patches for training, and standardized using stain normalization and histogram equalization.

% \begin{outline}
   
% \1 Training Details

% \2 BACH
% \3 We use the BACH microscopy image dataset, which consists of $2048 \times 1536$ RGB breast tissue images labeled as Normal, Benign, In situ carcinoma, or Invasive. 
% To maintain class balance and ensure representative sampling, we randomly select 20 images from each of the four categories (each containing 100 images) for the held-out test set and use the remaining 80 images per class for training and validation (split 65/15).
% This results in 320 training and 80 testing images in total, with validation images drawn from the training pool.
% \3 All microscopy images are kept at their original $2048 \times 1536$ resolution to preserve cellular and tissue-level detail. Matching the Camelyon17 setup, each image is processed through the TIAToolbox HoverNet Nucleus Instance Segmentation Tool to extract the same five expert-level pixel-wise features. (add tissue feature?? ) 
% The resulting tool outputs are rasterized into $2048 \times 1536$ pixel feature maps aligned with the original image dimensions.
% \3 For the black-box baseline, we first apply Macenko stain normalization on the raw microscopy imsages, then again train a pretrained EfficientNet-B0 classifier (ImageNet weights) directly on the normalized $2048\times1536$ RGB inputs using the same optimization settings as TBM. 

% \2 TBM training (same for both Camelyon and BACH)
% \3 During training, tool dropout is applied dynamically according to the keep mask $M \in \{0,1\}^{B\times T}$ defined by the selected TBM variant (Section reference Methods model variants section). 
% For each image $x_b$ and tool $t_i$, $M_{b,i}$ determines whether the corresponding tool feature map is retained or masked. 
% Instead of removing dropped tools entirely, we simulate their absence by setting the corresponding input channels to a constant value of $-1$, preserving tensor dimensionality and network structure. We choose to mask with $-1$ instead of $0$ because .... (cite). Masking out tools to simulate dropout rather than taking it out of the framework allows the TBM to only be trained once and be flexible across different combinations of tools during inference.
% The mask $M$ is constructed deterministically or stochastically depending on the regime: 
% \4 All Tools: all tools are kept for every image, $M = \mathbf{1}_{B\times T}$.
% \4 VLM Exact: for each image $x_b$, only tools selected by the VLM (MedGemma) are retained, 
% $M_{b,i} = \mathbf{1}[t_i \in S_b]$, where $S_b$ is the VLM-selected subset.
% \4 VLM Stochastic: for each image, tools selected by the VLM are retained with high probability ($p_{\mathrm{sel}} = 0.95$) and unselected tools with low probability ($p_{\mathrm{unsel}} = 0.10$), 
% $M_{b,i} \sim \mathrm{Bernoulli}(p_i)$.
% \4 VLM Global: each tool $t_i$ is retained independently with probability $\pi_i$, 
% the global selection frequency of that tool across all training images, 
% $M_{b,i} \sim \mathrm{Bernoulli}(\pi_i)$.
% \4 VLM Random: all tools are retained independently with a uniform probability of $p_{\mathrm{keep}} = 0.5$, 
% $M_{b,i} \sim \mathrm{Bernoulli}(0.5)$.
% \3 The resulting masked feature tensor is then passed through the TBM encoder (shared or separate), ensuring consistent input shape while exposing the model to varying subsets of tools throughout training.


% \3 \textbf{Shared Encoder TBM}
% \4 The shared-encoder TBM  uses an EfficientNet-B0 backbone as a joint feature encoder. The model input consists of all concatenated tool feature maps, with the EfficientNet stem modified to accept the total number of tool channels:
% The first convolutional layer is replaced with a new stem that adjusts for the summed input channels across all tool feature maps.
% The backbone features are pooled and passed through a fully connected classifier producing binary outputs (Tumor vs. Normal).
% To simulate missing or uncertain tools, a learnable dropout mask is applied over tool channels by dropping out unused tools with -1, which multiplies or replaces tool-specific channel slices with a learned null embedding when dropped.

% \3 \textbf{Separate Encoder TBM}
% \4 The separate-encoder TBM processes each tool through an independent EfficientNet-B0 encoder.
% Each encoder consists of:
% A convolutional stem followed by three depthwise-separable convolutional blocks.
% Global average pooling and a projection head mapping to a X dimensional embedding.
% Each tool embedding is concatenated across all tools and passed through a linear classifier head.
% Tool dropout is applied by masking tool-specific inputs before encoding, simulating partial tool availability.

% \2 We sample a random sample of 5000 images, 50 from each of the patient folders in the dataset. 
% \2 We obtain the bounding box, centroid, nucleus type, contour, and type probability features from each image by calling the TiaToolBox Hovernet Nucleus Instance Segmentation Tool. For all images, tool outputs were rasterized into $96\times96$ pixel feature maps. 
% \2 For the Blackbox baseline, an Efficient-net 


% \subsection{Baseline Methods}
% \label{sec:base-methods}
% \begin{outline}
% \1 Black-box
% \2 EfficientNet trained directly on raw images.
% \1 Y-Net (segmentation-for-classification baseline)
% \2 Y-Net extends U-Net by adding a parallel branch for discriminative (saliency) map generation alongside tissue segmentation. Encoder/decoder blocks are abstracted with width/depth multipliers to tune capacity without changing the network topology. Discriminative instance selection is performed on patch (instance) outputs. The per-instsance probability map is thresholded and combined with the segmentation to form a discriminative segmentation mask. Frequency and co-occurence features from this mask are fed into an MLP for slide-level diagnosis. Y-net supports  convolutional block
% modularity and matches state-of-the-art segmentation with far fewer parameters. 
% \2 Similar to TBM, Y-Net operates on specific supervised tasks, can leverage external knowledge through segmentation supervision, and is trainable end-to-end. However, it is less interpretable than TBM, as its intermediate representations are not decomposed into explicit, human-verifiable tool outputs.

% \1 General VLMs (GPT-o3)
% \1 Medical VLM/LLM (MedGemma/AIME)
% \1 MedGemma zero-shot predictions
% \2 MedGemma is a large multimodal vision–language model pretrained on medical images and clinical text across radiology, pathology, dermatology, and ophthalmology domains. In our experiments, we evaluate MedGemma in a zero-shot setting using textual prompts of the dataset-specific task without any task-specific fine-tuning. This baseline represents the performance of a powerful, general-purpose medical foundation model that leverages extensive external medical knowledge. Similar to TBM, it can integrate domain knowledge and operate across modalities. However, unlike TBM, it is not explicitly interpretable or trainable, as its reasoning process is opaque and not intervenable, and its outputs cannot be decomposed into verifiable, tool-level predictions.
% \1 LLM + API (ViperGPT)
% \1 Trained VQA models (FiLM)
 
