

\section{Dataset Details}
\begin{itemize}
    \item 
% \textbf{Camelyon17-WILDS.}
\label{sec:camelyon17-wilds}
The \textbf{Camelyon17-WILDS} dataset \cite{koh2021wilds} is adapted from the CAMELYON17 challenge~\cite{litjens2018camelyon}, which consists of whole-slide images (WSIs) of breast cancer metastases in lymph node sections. Each WSI is manually annotated by pathologists to mark tumor regions, from which non-overlapping $96 \times 96$ pixel patches are extracted and labeled as either \textit{Tumor} or \textit{Normal}. A patch is labeled \textit{Tumor} if the central $32 \times 32$ region contains any tumor tissue, and \textit{Normal} if it contains no tumor and at least $20\%$ normal tissue in that central region.

The dataset comprises approximately 450{,}000 patches extracted from 50 WSIs of breast cancer metastases in lymph node sections, collected from five hospitals in the Netherlands (10 WSIs per hospital). Each WSI was manually annotated by expert pathologists, and the corresponding segmentation masks were used to assign patch-level labels. Metadata includes the slide ID (WSI) and the hospital identifier (domain) for each patch.

To evaluate cross-domain generalization, data is split by hospital as follows: the \textit{training} split contains 302{,}436 patches from 30 WSIs (10 WSIs each from 3 hospitals): the \textit{validation (OOD)} split contains 34{,}904 patches from 10 WSIs from the 4th hospital; the \textit{test (OOD)} split contains 85{,}054 patches from 10 WSIs from the 5th hospital, chosen for its distinct staining characteristics and visual style; and the \textit{validation (ID)} split contains 33{,}560 patches from the same 30 WSIs used for training. This setup ensures that no WSIs or hospitals overlap between the training and OOD splits, making the benchmark well-suited for studying domain generalization and robustness to inter-hospital variability. The task we consider is binary classification: given a $96 \times 96$ histopathology patch, predict whether the central $32 \times 32$ region contains any tumor tissue.

% \subsubsection{ISIC 2017}
\label{sec:isic-2017}

\item
The \textbf{ISIC 2017} dataset provides dermoscopic images for skin lesion analysis, with three primary diagnostic categories: melanoma (malignant, melanocytic), nevus (benign, melanocytic), and seborrheic keratosis (SK) (benign, non-melanocytic). We evaluate two clinically relevant binary classification tasks: (1) \textit{Benign vs.\ Malignant}, where melanoma is treated as malignant and \{nevus, SK\} as benign; and (2) \textit{Melanocytic vs.\ Non-melanocytic}, where \{melanoma, nevus\} are grouped as melanocytic and SK as non-melanocytic.

The official training set contains 2{,}000 JPEG dermoscopic images with ground-truth diagnoses and a CSV of minimal clinical metadata (image\_id, age\_approximate, sex). Class counts in the training set are 374 melanoma, 254 SK, and 1{,}372 nevi. The validation and test sets contain 150 and 600 images, respectively. Ground-truth labels are provided via two binary indicators: a melanoma indicator (1 for melanoma, 0 otherwise) and an SK indicator (1 for SK, 0 otherwise), which we map to the two binary tasks above. Images are high-resolution dermoscopic photographs ($767 \times 1022$ pixels), which we preprocess to a fixed input resolution before tool extraction and model training.

For the malignant vs.\ benign task, the malignant class is underrepresented (Train: 374/2000; Val: 30/150; Test: 117/600), while for the melanocytic vs.\ non-melanocytic task the non-melanocytic (SK) class is the minority (Train: 254/2000; Val: 42/150; Test: 90/600). 
We compute a positive-class weight from the training distribution and apply it in the BCE loss for each task. 

\end{itemize}

\section{Baseline Details}
\label{app:baseline_details}
\begin{itemize}
\item
\textbf{MedGemma Zero-Shot.} We evaluate MedGemma in a zero-shot setting using text prompts. We prompt MedGemma to return the predicted binary label, as well as a probability score $[0,1]$ for each prediction, in order to report Accuracy and AUC.
This baseline represents the performance of a general-purpose medical VLM. 
Similar to \frameworkabbr{}, it can integrate domain knowledge and operate across modalities. 
However, unlike \frameworkabbr{}, it is not explicitly interpretable or trainable, as its reasoning process is opaque and not intervenable, and its outputs cannot be decomposed into verifiable, tool-level predictions.

\item 
{\textbf{MedGemma w/ Tool Prompts}.}
We evaluate a tool-use variant of MedGemma that operates in a two-turn, zero-shot fashion using our toolbox $\mathcal{T}$. In the first turn, we prompt MedGemma as a tool selector detailed in Section~\ref{sec:vlm}, with the full prompts listed in Appendix~\ref{app:vlm_prompting}. The outputs of the selected tools are then rasterized into spatial map images and then passed as input to MedGemma in a second turn as additional visual inputs, along with the original image and task to obtain the final predicted label probability scores. We do not fine-tune the model on any task labels. This baseline represents a medical VLM that can both select and utilize the same tools used by \frameworkabbr{}. 
\item
\textbf{Gemma Zero-Shot.} We evaluate Gemma 3 in the same way as MedGemma Zero-Shot. This baseline represents the performance of a general VLM.  
\item 
{\textbf{Gemma w/ Tool Prompts}.}
We evaluate a tool-use variant of Gemma 3 in the same way as MedGemma w/ Tool Prompts. This baseline represents a general VLM that can both select and utilize the same tools used by \frameworkabbr{}. 
\item 
{\textbf{VisProg}.} We evaluate VisProg~\cite{gupta2023visual}, a tool-use system that answers visual questions by composing executable programs, which access tools from our toolbox $\mathcal{T}$. 
For all tasks, we enable VisProg to utilize the tools specific for the given modality.
% For ISIC 2017, we extend VisProg with our dermatology tools (lesion segmentation, dermoscopic structure maps, and color-based marker tools) exposed as callable primitives, and prompt the underlying VLM to write programs that call these tools.
% and then answer a task-specific yes/no question (e.g., ``Using all available tools, is this dermoscopic image malignant?''). 
% For Camelyon17, we similarly register histopathology tools (nucleus centroids, bounding boxes, contours, type maps, and type probabilities) and prompt VisProg to compose tool calls before answering whether a patch contains tumor tissue. 
We prompt it to return both a binary classification as well as a predicted probability to compute Accuracy and AUC.
However, we found that VisProg fails to give outputs that are not binary, even when specifically prompted to do so.
Thus, to compute AUC, we map the binary prediction to $\{0, 1\}$ to obtain probabilities. 
For both datasets, we do not fine-tune the code generation VLM or the VQA (BLIP) VLM on our data. 
This baseline therefore reflects the performance of a code-generation tool-use model that performs text-based compositional reasoning while explicitly leveraging the same toolbox used by our TBF.

\item 
\textbf{EfficientNet.} For comparison with a standard CNN, we use the EfficientNet-B0 architecture~\cite{tan2019efficientnet}, which is a black-box baseline trained directly on raw images, without any tool inputs. 
Images are resized to match the \frameworkabbr{} input resolution ($96 \times 96$ for Camelyon17 and $224 \times 224$ for ISIC), and we use the same optimizer configuration as for \frameworkabbr{}. %Adam with learning rate $10^{-3}$, weight decay $10^{-4}$, and training for 40 epochs on the corresponding training split. 
% We report results for models trained from scratch as well as models initialized with ImageNet-pretrained weights and fine-tuned end-to-end.


\item
\textbf{Y-Net.} 
\frameworkabbr{} leverages additional data in the form of tools and their pixel-level outputs.
To disentangle the effect of our model formulation and increased amount and type of data, we compare against a popular Y-Net–style segmentation-for-classification model~\cite{mehta2018net}. 
Y-Net leverages pixel-wise supervision in addition to image-level labels, but does not use explicit tool decomposition.
Thus, Y-Net and \frameworkabbr{} are trained on the same amount and type of data.
Training is performed in two stages: 
In Stage 1, we train Y-Net with only the segmentation head enabled to predict tumor-versus-background masks from raw patches ($96\times96$ for Camelyon17 and $224\times224$ for ISIC), using tumor masks, derived from HoverNet outputs for Camelyon and lesion segmentation masks for ISIC, as the supervision target. 
The model is optimized with pixel-level cross-entropy loss using Adam (learning rate $10^{-3}$, batch size 16) for 25 epochs, and monitored with Dice score. 
In Stage 2, we jointly train the segmentation and classification head with the combined loss
$
\mathcal{L} = \mathcal{L}_{\text{seg}} + \mathcal{L}_{\text{cls}},
$
where both $\mathcal{L}_{\text{seg}}$ and $\mathcal{L}_{\text{cls}}$ are cross-entropy losses for the segmentation mask and binary label, respectively. 
Stage 2 uses Adam with a lower learning rate ($3\times10^{-4}$), batch size 16, and 20 epochs. %, and is evaluated with classification accuracy for Camelyon and Test AUC for ISIC, and segmentation Dice.
For both datasets,  Y-Net is trained on the same data splits and training setup as \frameworkabbr{}. %, ensuring a fair comparison of Y-net's segmentation-for-classification versus TBM's tool-based decomposition.

\item 
{\textbf{LLaVA-Med Zero-Shot and Finetuned.}}
We use the released LLaVA-Med~\cite{li2023llavamed} \texttt{llava-med-v1.5-mistral-7b} checkpoint for all LLaVA-Med experiments. 
For the zero-shot setting, we keep all model weights frozen and only prompt the model with task-specific instructions and answer formats. 
The exact prompts for all tasks are provided in Appendix~\ref{app:llavamed}. We ask the model to output a scalar confidence score in $[0,1]$ for the positive class and the binary class label. We use the continuous scores to compute AUC. 
For the finetuned setting (LLaVA-Med FT), we further finetune on the training set, where the corresponding text for each image is \{``tumor'', ``no tumor''\} for Camelyon17, \{``malignant'', ``benign''\} for ISIC-BM, and \{``melanocytic'', ``non-melanocytic''\} for ISIC-MN. 
% and adapt it to our binary classification tasks by minimizing the cross-entropy of the textual class labels on the training set. 
We unfreeze the last two vision transformer blocks and apply a LoRA adapter to the language model's attention and MLP projection layers. 
We train for $4$ epochs with AdamW, with learning rate $5\times 10^{-6}$, weight decay $0.01$, and a cosine learning-rate schedule with a warmup ratio of $0.1$. 
\end{itemize}
\section{Tool Details}
\label{app:tool_details}
\subsection{Camelyon17}
For Camelyon17, we construct histopathology tools by using the open-source TIAToolbox~\cite{pocock2022tiatoolbox} computational pathology toolbox that provides end-to-end patholology image analysis. We specifically use the HoVer-Net~\cite{graham2019hover} model for nucleus-level predictions, specifically segmentation and classification. We apply this tool to each Camelyon17 patch, producing instance-level nuclei predictions, including per-nucleus class type labels, class type label probabilities, bounding boxes, segmentation contours, and centroid coordinates.
For each patch, HoverNet returns a dictionary indexed by a unique nucleus identifier \texttt{nuc\_id}, where every entry contains:
\texttt{box}, \texttt{centroid}, \texttt{contour}, \texttt{prob}, and \texttt{type}, corresponding to the nuclei instance-level predictions of bounding boxes, centroid coordinates, segmentation contours, class type label probabilities, and class type labels. 
We convert these instance-level outputs into spatial feature maps rasterized onto a $96 \times 96$ grid aligned with the corresponding tissue patch:
\begin{itemize}
    \item \textbf{Bounding boxes (\texttt{histo\_nuc\_bbox}).} The \texttt{box} field stores bounding box coordinates in the format
    $[x_{\text{top-left}}, y_{\text{top-left}}, \text{width}, \text{height}]$.
    For each nucleus, we fill its bounding box on a binary canvas and then downsample to $96 \times 96$.
    Overlapping boxes are merged by taking the per-pixel maximum, yielding a single-channel map that highlights regions with dense or enlarged nuclei.
    
    \item \textbf{Centroids (\texttt{histo\_nuc\_centroid}).}
    The \texttt{centroid} field stores the nucleus center in $[x_{\text{centre}}, y_{\text{centre}}]$ coordinates. We place a small blob at each centroid location on a blank canvas and resample to $96 \times 96$, producing a single-channel map that encodes the spatial distribution of nuclei (nuclear density and clustering).

    \item \textbf{Contours (\texttt{histo\_nuc\_contour}).}
    The \texttt{contour} field is a list of points forming the polygonal boundary of each nucleus.
    We rasterize these polygons by drawing only the boundary pixels on a binary canvas and downsampling to $96 \times 96$. This single-channel map emphasizes nuclear shape and boundary irregularity while suppressing interior regions.
    
    \item \textbf{Nucleus types (\texttt{histo\_nuc\_type}).} The \texttt{type} field stores the predicted discrete class label for each nucleus as an integer in $\{0,\dots,5\}$, corresponding to categories such as background, neoplastic epithelial, inflammatory, connective, dead, and non-neoplastic epithelial cells. We create a multi-channel one-hot tensor by assigning each pixel within a nucleus to its predicted class and stacking the resulting binary masks across channels.
    This yields a $C_{\text{type}} \times 96 \times 96$ tensor (with $C_{\text{type}}=6$ in our implementation) that captures spatial distributions of different cell types.

    \item \textbf{Type probabilities (\texttt{histo\_nuc\_type\_prob}).} The \texttt{prob} field stores the confidence (probability) of the predicted type for each nucleus. We propagate this scalar confidence value to all pixels in the corresponding nucleus mask and aggregate overlapping instances by taking the per-pixel maximum. The result is a single-channel confidence map in $[0,1]$ that highlights regions where the model is confident about nuclear identity versus uncertain or ambiguous regions.
\end{itemize}
All five histopathology tools are computed once using the pretrained TIAToolbox HoverNet model and kept fixed. We do not fine-tune the underlying nucleus segmentation tool on Camelyon17. The resulting maps are concatenated along the channel dimension into a tool tensor of shape $(C_{\text{tool}}, 96, 96)$, with $C_{\text{tool}} = C_{\text{type}} + 4$. As described in Section~\ref{sec:dataset_and_tools}, all Camelyon17 tool maps are scaled to lie in $[0,1]$ before masking; dropped tools are represented by channels filled with a constant ``missing'' value of $-1$ (a constant map of $-1$s) and are handled consistently across all masking regimes. Details on filling dropped tools with the ``missing'' values are provided in Section~\ref{app:tool_knockout}.

\subsection{ISIC}
For ISIC 2017, images are high-resolution dermoscopic photographs ($767 \times 1022$ pixels), which we preprocess to a fixed input resolution before tool extraction. We construct a dermatology toolbox with seven tools: lesion segmentation, pigment network, negative network, streaks, milia-like cysts, malignant-color pigment marker, brown pigment marker. Each tool produces a single-channel spatial map that is rasterized to a common $H \times W$ resolution ($224\times 224$ in our experiments) and stacked along the channel dimension:
\begin{itemize}
    \item \textbf{Lesion segmentation \texttt{(derm\_lesion\_segmenter)}.} First, we train our own lightweight U-Net--style lesion segmentation model using the pixel-wise lesion masks provided in the ISIC 2017 training set. The model takes the raw dermoscopic RGB image as input and outputs a high-resolution binary lesion mask, which we treat as a spatial tool map aligned with the original image. This mask is used both as a standalone tool channel and as a region-of-interest mask for the color-based marker tools described below.
    
    \item \textbf{Dermoscopic structure maps \texttt{(derm\_pigment\_network, derm\_negative\_network, derm\_streaks\_detector, derm\_milia\_like\_cyst\_detector)}.} 
    % Because robust off-the-shelf segmentation tools for dermoscopic structures such as \emph{pigment network}, \emph{negative network}, \emph{streaks}, and \emph{milia-like cysts} are not widely available, and the ISIC annotations for these structures are sparse superpixel-level labels, we do not train separate detectors for them. Instead, we use the ground-truth dermoscopic feature annotations provided by ISIC 2017 as hypothetical tool outputs—representing an idealized scenario in which accurate detectors for these clinically meaningful cues ``exist''.
    To obtain predicted dermoscopic feature maps, we integrate the open-source skinisic model \cite{Kawahara2019}, a VGG16‑based fully convolutional network finetuned on ISIC superpixel labels converted into per‑pixel multi‑channel segmentations, which predicts pixel‑wise probability maps for pigment network, negative network, streaks, and milia‑like cysts. We pass each ISIC 2017 image through skinisic to obtain four probability maps, which we threshold (0.5 by default) to obtain binary pixel maps that are resized to $H \times W$ and passed into the TBM as tool outputs.
    % Concretely, ISIC provides a color-coded superpixel map and a JSON file specifying, for each dermoscopic structure, which superpixel indices are positive. We decode the superpixel index for each pixel from the RGB superpixel image and, for each structure $s \in \{$pigment network, negative network, milia-like cyst, streaks$\}$, build a binary map that is $1$ on pixels whose superpixel is labeled as containing structure $s$ and $0$ elsewhere. These four binary maps are then resized to $H \times W$ and treated as if they were outputs of pretrained expert tools, allowing us to quantify an upper bound on the value of tool-based decomposition.
    \item \textbf{Color-based marker tools \texttt{(derm\_marker\_malignant\_union, derm\_marker\_browns)}.} Finally, we include two additional color-based tools inspired by prior work on color and texture features for dermoscopic lesion classification~\cite{marques2012role} and color-constancy-based preprocessing for robust skin image analysis~\cite{ciurea2003large}. Following these approaches, we first apply a simple ``shades of gray'' color constancy transform~\cite{ciurea2003large} to the RGB image to reduce illumination variability. In this normalized color space, we apply handcrafted threshold rules over the $(R,G,B)$ channels to detect canonical melanoma-related color patterns (e.g., very dark/black regions, blue-gray areas, white structures) and different shades of brown. Small connected components and holes below a minimum area threshold are removed via morphological post-processing, and all marker maps are restricted to the predicted lesion region by intersecting with the lesion segmentation mask. We then aggregate these fine-grained markers into two compact binary tools: (i) a \emph{malignant-colored pigment marker}, defined as the union of black, blue-gray, and white melanoma-associated colors, and (ii) a \emph{brown pigment marker}, defined as the union of light- and dark-brown regions.
    Both maps are rasterized as $H \times W$ binary images aligned with the dermoscopic image and appended as additional channels in the tool stack.
\end{itemize}
All seven ISIC tools are computed once per image and then kept fixed. The resulting maps are concatenated along the channel dimension into a tool tensor of shape $(C_{\text{tool}}, 224, 224)$ (with $C_{\text{tool}} = 7$ in our main experiments). As described in Section~\ref{sec:dataset_and_tools}, all ISIC tool channels are scaled to lie in $[0,1]$ before masking. Similar to Camelyon17, we represent dropped tools with  a constant map of $-1$s. 
% \begin{figure}[htbp]
% \floatconts
%   {fig:isic_medgemma_tool_combos}% label for whole figure
%   {\caption{Distribution of MedGemma selected tool combinations (top 3 tools) across the ISIC 2017 Train and Test sets).}}% caption for whole figure
%   {%
%     \subfigure[Train MedGemma Tool Combinations]{%
%       \label{fig:isic_melano_train_tool_combos}%
%       \includegraphics[width=0.8\linewidth]{figs/isic_melano_train_tool_combos.png}%
%     }\qquad
%     \subfigure[Test MedGemma Tool Combinations]{%
%       \label{fig:isic_melano_test_tool_combos}%
%       \includegraphics[width=0.8\linewidth]{figs/isic_melano_test_tool_combos.png}%
%     }%
%   }
% \end{figure}

\section{Additional Results}
\label{sec:additional-results}
\subsection{Effect of Pretraining}
\label{app:effect_of_pretraining}
Table~\ref{tab:camelyon_tbm_unpretrained} summarizes TBF performance under scratch vs.\ ImageNet initialization when evaluated on Camelyon17. TBF achieves 86.7\% accuracy when trained from scratch, and increases to 92.3\%. Even without pretraining, TBF already outperforms non-pretrained EfficientNet and Y-Net baselines (72.6\% and 72.0\%), indicating that the tool bottleneck structure itself provides generalization benefits under limited supervision.

% Across both encoder types, ImageNet pretraining consistently improves TBM performance on the Camelyon17 dataset. When trained from scratch (Table~\ref{tab:camelyon_tbm_unpretrained}), the best separate-encoder TBM (VLM--Perturbed) reaches 87.7\%, while the best shared-encoder TBM achieves 86.7\%. With ImageNet initialization (Table~\ref{tab:camelyon_tbm_pretrained}), these values increase to 89.2\% (separate) and 90.8\% (shared), an accuracy gain of roughly 2-3 points. 


\begin{table}[t]
\floatconts
  {tab:camelyon_tbm_unpretrained}% label
  {\caption{Accuracy (\%) on Camelyon17 for \frameworkabbr{} variants and baselines with and without ImageNet pretrained weights. Best value per pretraining type is \textbf{bolded}.}}
  {\centering\input{tables/camelyon_tbm_unpretrained.tex}}
\end{table}

Among pretrained models, TBF with and without perturbation exceed the accuracy of the pretrained EfficientNet and Y-Net (88.6\% and 88.2\%). This demonstrates that TBF not only benefits from better representations, but also from decomposing the image into domain-grounded tool features, reducing reliance on spurious dataset-specific effects and improving robustness under distribution shift.


% Among pretrained models (Table~\ref{tab:camelyon_tbm_vs_baselines}), the best shared TBM (VLM--Perturbed, 90.8\%) and separate TBM (VLM--Perturbed, 89.2\%) both exceed the pretrained 
% EfficientNet black-box  and pretrained Y-Net .  Even when trained from scratch, TBMs reach 86--88\%, outperforming both black-box and Y-Net baselines by 10–15 points. This demonstrates that TBMs benefit not only from better representations, but also from decomposing the image into domain-grounded tool features—reducing reliance on spurious dataset-specific effects and improving robustness under distribution shift.

% \begin{table}[htbp]
% \floatconts
%   {tab:camelyon_tbm_pretrained}% label
%   {\caption{Camelyon17 (subset of 5,000 images): Test accuracy (\%) for TBM variants trained with ImageNet pretrained weights. Best value per encoder type is \textbf{bolded}.}}
%   {
%     \begin{tabular}{lcc}
%     \bfseries Model Variant & \bfseries Encoder Type &  \bfseries Test Acc. (\%) \\
%     \hline
%     TBM (All Tools, val on VLM-sel tools) & Separate & 72.2 \\
%     TBM–VLM Random & Separate & 88.5 \\
%     TBM–VLM Stochastic & Separate & \textbf{89.2} \\
%     TBM–VLM Exact & Separate & 89.0 \\
%     \hline
%     TBM (All Tools, val on VLM-sel tools) & Shared & 64.9 \\
%     TBM–VLM Random & Shared & 90.3 \\
%     TBM–VLM Stochastic & Shared & \textbf{90.8} \\
%     TBM–VLM Exact & Shared & 90.2 \\
%     \end{tabular}}
% \end{table}

% \begin{table}[htbp]
% \floatconts
%   {tab:camelyon_tbm_vs_baselines}% label
%   {\caption{Comparison of best-performing TBM variants against baseline models on Camelyon17 (subset of 5,000 images).}}
%   {\centering\input{tables/camelyon_tbm_vs_baselines.tex}}
% \end{table}

% \subsubsection{Availability of Tools ?}
% Table~\ref{tab:camelyon_upper_bound} compares the TBM--All Tools configuration, evaluated with full tool availability, against the best VLM-restricted TBM (VLM--Perturbed). 
% When all tools are always present, TBM--All Tools attains slightly higher test accuracy (e.g., 92.1\% vs.\ 90.8\% for the pretrained shared encoder, and 90.2\% vs.\ 89.2\% for the pretrained separate encoder), defining an upper bound on performance under idealized conditions where every expert tool is available and perfectly reliable at inference. 
% However, this setting does not reflect realistic clinical workflows, where tool availability is variable and some tools may be noisy, slow, or missing entirely. The VLM--Perturbed TBM trades at most 1–2 percentage points of accuracy for the ability to operate robustly under per-case tool subsets guided by MedGemma, providing a flexible bottleneck that better matches real-world constraints and supports instance-specific tool selection while preserving strong overall performance.

% \begin{table}[htbp]
% \floatconts
%   {tab:camelyon_upper_bound}% label
%   {\caption{Upper bound vs.\ VLM restricted evaluation on Camelyon17 (subset of 5,000 images). 
%     TBM--All Tools is evaluated with \emph{all} tools available (upper bound). 
%     Best TBM under restricted tool availability uses \emph{VLM--Perturbed} (per-image VLM-selected tools).}}
%   {
%     \begin{tabular}{lccc}
%     \toprule
%     \bfseries TBM variant & \bfseries Pretraining &  \bfseries TBM & \bfseries TBM--VLM Perturbed \\
%     \midrule
%     $\TBMl$ & No  & \textbf{89.3} & 87.7 \\
%     $\TBMl$ & Yes & \textbf{90.2} & 89.2 \\
%     $\TBMe$   & No  & \textbf{87.8} & 86.7 \\
%     $\TBMe$   & Yes & \textbf{92.1} & 90.8 \\
%     \bottomrule
%     \end{tabular}}
% \end{table}


\subsection{Tool Sampling Strategies}
\label{app:tbf_training_ablations}
% In addition to the main TBM architecture, we also explore a late fusion variant.
% % Each tool representation $\bm{z}_i$ is encoded via tool encoder(s) $g_\theta$ to produce latent feature embeddings $\bm{h}_i = g_i(\bm{z}_i)$.
% Instead of stacking all tool maps along the channel dimension, each tool map $\bm{z}_i$ is processed by its own feature extractor $g_{\theta_i}$ and outputs a scalar $g_{\theta_i}(\bm{z}_i) = h_i \in \mathbb{R}$.
% The embeddings are concatenated and passed into the classifier:
% $$
% \bm{y} = c_\phi([h_1, ..., h_N]).
% $$
% This formulation preserves modularity and allows inspection of the learned weights of each tool. %, supporting interpretability and intervention.
% However, this limits the expressibility of the TBM, as the model can only fuse information at a later stage.

% % Both architectures can be augmented with Tool Dropout and VLM-guided priors, described in the following sections, to enable flexible use of subsets of tools and improve robustness to missing or noisy tool outputs.

% Table~\ref{tab:late_fusion_tbm} shows the results, where $\TBMl{}$ denotes the late fusion variant of TBM.
% Generally, we observe that the model performs slightly worse than TBM with early fusion, likely due to the model being able to learn the optimal fusing of information within the shared network.
% $\TBMe$ achieves the highest overall accuracy, reaching 90.8\% test accuracy. $\TBMl$ attains 89.2\% accuracy, which is competitive but slightly lower. This gap reflects an architectural trade-off: $\TBMe$ jointly processes stacked tool maps, enabling early spatial fusion and cross-tool feature interactions (e.g., how nuclei count and nucleus type co-vary in malignant tissue). In contrast, $\TBMl$ processes each tool with its own backbone and concatenates scalar embeddings to fuse the tool features later, which makes per-tool contributions more interpretable but limits the ability to model fine-grained interactions between tools. This leads to a slight accuracy advantage for $\TBMe$ , while $\TBMl$ offers clearer modularity and easier attribution at the level of individual tools.

In addition to TBF, TBF without perturbation ($\alpha$ = 1), and TBF with all modality-specific tools described in Table~\ref{tab:tbm_vs_baselines}, we explore other tool sampling strategies.  
We explore (1) \textit{Bernoulli}: sampling from a Bernoulli independently for each tool, (2) \textit{Random top-$k$}: fixing $k$ tools per image by randomly selecting $k$ tools for each image, and (3) $\alpha = c$: sweeping $\alpha$ values over varying values of $c$ to control the strength of the VLM prior and tool perturbation, as described in Section~\ref{sec:tbm}. We set $k=3$ for all experiments. TBF ablation results are shown in Table~\ref{tab:tbf_ablations}.

\begin{table}[htbp]
\floatconts
  {tab:tbf_ablations}% label
  {\caption{Performance of TBF ablations across Camelyon17 (Accuracy) and ISIC 2017 (AUC). }}
  {\centering\input{tables/tbf_ablations.tex}}
\end{table}

\subsection{VLM Prompting Strategies}
\label{app:vlm_prompting}
In addition to fixed top-$k$, we also experiment with dynamic top-$k$, which permits a dynamic number of tools per image but maintains $k$ tools per image on average over the training set.
Specifically, instead of prompting the VLM to output a fixed selection of $k$ tools per image, we prompt the VLM to score each tool between $[0, 1]$.
Then, we collect all tool scores for all images in the training set, and compute the cutoff score that corresponds to $k$ tools per image on average.
This cutoff is used for all dataset splits. The results are shown in Table~\ref{tab:tbf_ablations}.



\subsection{Tool Output Intervention}
\label{app:intervention}
{
\setlength{\belowcaptionskip}{-1pt}
\begin{figure}[t]
\floatconts
  {fig:nuc_dropout_visualization}
  {\vspace{-1.5\baselineskip}
  \caption{\textbf{(a)-(e)}: Visualization of the nuclei–dropout intervention on two example Camelyon17 contour maps. For each example patch, we randomly remove individual nuclei by masking them out with probability $p_{\text{mask}}$, which we sweep across $p_{\text{mask}} \in \{0.0, 0.2, 0.4,0.6,0.8\}$ (from left to right) to randomly mask out nuclei in the tool output maps.
  \textbf{(f)}: As  $p_{\text{mask}}$ is increased (dropout increased), the fraction of images with a \textit{Normal} label prediction increases monotonically with dropout.}}
  {\includegraphics[width=0.8\linewidth]
  {figs/fig_4/camelyon_nuclei_dropout.pdf}}
\end{figure}
}
Besides tool importance discussed in Section~\ref{sec:interpretability}, another method for interrogating \frameworkabbr{}'s decision-making is to intervene or manipulate the tool outputs, since the features encoded in the tool outputs are clinically meaningful.
To test whether \frameworkabbr{} relies on nuclei-related features in a clinically meaningful way, we perform a nuclei–dropout intervention on Camelyon17.
For each patch, we drop individual nucleus instances independently with probability $p_{\text{mask}}$. When a nucleus is dropped, we set its corresponding tool maps (centroid, bounding-box fill, contour, type one-hot, and type probability) to the background value at the associated pixels (See Figure~\ref{fig:nuc_dropout_visualization}). 

We sweep $p_{\text{mask}} \in \{0.0, 0.2, 0.4, 0.6, 0.8\}$ over all nuclei in the validation set, and compute the fraction of patches predicted as Normal, $\Pr(\hat{y}=\text{Normal})$. 
We observe that as $p_{\text{mask}}$ increases, $\mathrm{P}(\hat{y}=\text{Normal})$ increases. 
% Across training patches, Tumor patches tend to have higher density of nuclei than Normal patches.
% Dropping nuclei randomly by increasing $p_{\text{mask}}$ pushes the manipulated tool output feature maps to ``appear more \textit{Normal}'' in the tool space, leading to the model's benign probability prediction increasing.
% This directional shift (model predictions flipping from \textit{Tumor} to \textit{Normal}) is similar to what we would predict if the model learned a causually meaningful link between more nuclei and more malignant evidence. 
This behavior is consistent with established histopathology criteria, where higher nuclear density is a characteristic of malignant breast lesions and correlate with worse prognosis\cite{narasimha2013significance}. Nuclear morphology studies report that increased nuclear area is associated with node-positive and higher-grade breast carcinomas \cite{kuenen1984prognostic, pienta1991correlation}.

% Our intervention experiment confirm that TBMs are not only interpretable but also intervenable: their predictions respond in predictable, biologically consistent ways when specific tool-derived features are altered. This form of counterfactual analysis is not feasible in conventional black-box models, which rely on post-hoc interpretability/explanation methods. In contrast, TBM interpretability arises  directly from the model's structured design. Each tool represents a distinct, intervenable direction in which domain knowledge, provided by a clinician for instance, influences the prediction.  These experiments demonstrate that the model's reasoning process can be verified and manipulated in a principled way, enabling causal interpretability that is grounded in expert-defined features.

\subsection{Combinations of VLM Tool Selections}
\label{app:tool_combinations}
\begin{figure}[t]
\floatconts
  {fig:camelyon_medgemma_tool_distr}
  {\vspace{-1.5\baselineskip}
  \caption{Distribution of MedGemma selected tool combinations for Camelyon17}}
  {\includegraphics[width=0.5\linewidth]
{figs/fig_5/camelyon_tool_combo}}
\end{figure}
\begin{figure}[t]
\floatconts
  {fig:isic_medgemma_tool_combos}% label for whole figure
  {\caption{Distribution of MedGemma selected tools combinations for top 3 tools across the ISIC 2017 tasks}}% caption for whole figure
  {%
    \subfigure[ISIC-BM]{%
      \label{fig:isic_bm_tool_distr}%
      \includegraphics[width=0.48\linewidth]{figs/fig_6/medgemma_malig_train_tool_combo.png}%
    }
    \subfigure[ISIC-MN]{%
      \label{fig:isic_mn_tool_distr}%
      \includegraphics[width=0.48\linewidth]{figs/fig_6/medgemma_melano_train_tool_combo.png}%
    }%
  }
\end{figure}
In addition to the normalized frequency of tool selections in training depicted in Fig.~\ref{fig:loto_freq_plots}, we also depict the distribution of unique combinations of MedGemma selected tools (top 3) for Camelyon17 and ISIC in Figures~\ref{fig:camelyon_medgemma_tool_distr} and~\ref{fig:isic_medgemma_tool_combos}.

\section{Details on Tool Knockout}
\label{app:tool_knockout}

We show that our tool knockout augmentation enables $f_\theta$ to estimate the full conditional (the distribution of $Y=y$ conditioned on all tool outputs) and all marginal conditionals (the distribution $Y=y$ conditioned on any subset of tool outputs).
Our argument follows the theoretical analysis of Knockout by Nguyen et al.\ \cite{nguyen2025knockoutsimplewayhandle}.

Let $\bm{z}_i = t_i(\bm{x})$ denote the tool output for the $i$'th tool. 
Let $\bm{Z} = (\bm{z}_1,\dots,\bm{z}_N)$ denote the collection of all $N$ tool outputs. 
Let $\mathcal{M}$ denote an indicator set of knocked-out tools.
Let $\bm{M}$ denote the corresponding binary mask vector $\bm{M}=(M_1,\dots,M_N)\in\{0,1\}^N$. 
Note that $\bm{M}$ is sampled independently of $(\bm{Z},Y)$. 
$\bm{Z}_{\mathcal{M}}$ denotes $\bm{Z}$ with elements knocked-out according to indices in $\mathcal{M}$, and $\bm{Z}_{-\mathcal{M}}$ denotes all non-knocked-out elements.
We denote the full conditional as $p(Y \mid \bm{Z})$ and all marginal conditionals as $p(Y \mid \bm{Z}_{-\mathcal{M}})$. 

During training, we construct knockout-augmented inputs
\[
\bm{Z}'(\bm{M},\bm{Z}) = \bm{M} \odot \bar{\bm{z}} + (\bm{1}-\bm{M}) \odot \bm{Z},
\]
where $\bar{\bm{z}}$ is a placeholder feature map chosen to lie outside the support (or in a negligible-density region) of $\bm{Z}$ and $\odot$ is element-wise multiplication.\footnote{As mentioned in the main paper, this is easily implemented by replacing the tool output $\bm{z}_i$ with $\bm{\bar{z}}_i$ of the same shape.}
This ensures the equivalence
\[
\bm{Z}'_{\mathcal{M}} = \bar{\bm{z}}_{\mathcal{M}} \;\Longleftrightarrow\; \bm{M}_{\mathcal{M}} = \bm{1}, 
\qquad\quad
\bm{Z}'_{\mathcal{M}} \neq \bar{\bm{z}}_{\mathcal{M}} \;\Longleftrightarrow\; \bm{M}_{\mathcal{M}} = \bm{0} \text{ and } \bm{Z}'_{\mathcal{M}}=\bm{Z}_\mathcal{M},
\]
where $\bm{0}$ and $\bm{1}$ are vectors of zeros and ones of appropriate shape.
% so missingness is perfectly encoded by the observed values themselves. 
% The TBM prediction is $f_\theta(\bm{z}'(\bm{M},\bm{z}))$.

Because $\bm{M}$ is independent of $(\bm{Z},Y)$:
\begin{align}
p(Y \mid \bm{Z}'_{\mathcal{M}}=\bar{\bm{z}}_{\mathcal{M}},\, \bm{Z}'_{-\mathcal{M}}=\bm{z}_{-\mathcal{M}})
&= p(Y \mid \bm{M}_{\mathcal{M}}=\bm{1}, \bm{M}_{-\mathcal{M}}=\bm{0}, \; \bm{Z}_{-\mathcal{M}} = \bm{z}_{-\mathcal{M}}) \\
&= p(Y \mid \bm{Z}_{-\mathcal{M}} = \bm{z}_{-\mathcal{M}}).
\end{align}
Thus, knockout augmentation exactly corresponds to marginalization of the missing tool outputs. 
% The TBM prediction under any missing-tool pattern is
% \[
% f_\theta(\bar{\bm{z}}_{\mathcal{M}}, \bm{Z}_{-\mathcal{M}}) \approx \arg\max_Y p(Y \mid \bm{Z}_{-\mathcal{M}}).
% \]

To see that $f_\theta$ learns all such marginals simultaneously, consider the expected training loss under tool knockout:
\begin{align}
\mathcal{L}(\theta)
&= \mathbb{E}_{\bm{Z}',Y}\, \mathbb{E}_{\bm{M}}\,
\ell\!\left(Y,\, f_\theta(\bm{Z}'(\bm{M},\bm{Z})\right) \\
&= \mathbb{E}_{\bm{Z},Y}\, \mathbb{E}_{\bm{M}} \sum_{\bm{m} \in \bm{M}} \mathbb{I}(\bm{M}=\bm{m})\;
\ell\!\left(Y,\, f_\theta(\bm{Z}'(\bm{m},\bm{Z}))\right) \\
&= \mathbb{E}_{\bm{Z},Y}\,  \sum_{\bm{m} \in \bm{M}} p(\bm{M}=\bm{m})
\;
\ell\!\left(Y,\, f_\theta(\bm{Z}'(\bm{m},\bm{Z}))\right) \\
&= \sum_{\bm{m} \in \bm{M}} p(\bm{M}=\bm{m})
\;\mathbb{E}_{\bm{Z},Y}\,
\ell\!\left(Y,\, f_\theta(\bm{Z}'(\bm{m},\bm{Z}))\right),
\end{align}
where $\mathbb{I}$ is the indicator function.
That is, the objective is a weighted sum of losses, each corresponding to estimating a conditional distribution in the support of $p(\bm{M})$.
% In particular,
% \begin{itemize}
% \item If $\bm{m}=\mathbf{0}$ (no knockout), then $f_\theta(\bm{z})$ learns $p(y \mid \bm{z}_1,\dots,\bm{z}_N)$.
% \item If $\bm{m}$ has a single $1$, then $f_\theta(\bar{\bm{z}}_i, \bm{z}_{-i})$ learns $p(y \mid \bm{z}_{-i})$.
% \item For general $\bm{m}$, $f_\theta(\bar{\bm{z}}_{\bm{M}}, \bm{z}_{-\bm{M}})$ learns $p(y \mid \bm{z}_{-\bm{M}})$.
% \end{itemize}

\section{Prompts}
\label{app:prompts}

We use MedGemma as a vision–language tool selector. For each image, modality, and task, we provide the model with
(1) the toolbox $\mathcal{T} = \{ t_1, \ldots, t_N \}$
(2) a brief description of the modality and each tool
(3) a task-specific natural language instruction

The model is asked to return a JSON object specifying which tools from $\mathcal{T}$ should be used for that image and task.

For Camelyon17 (histopathology), the toolbox consists of nuclei-level tools that identify nuclei centroids, bounding boxes, contours, types, and type probabilities:
 \path{histo_nuc_centroid}, \path{histo_nuc_bbox}, \path{histo_nuc_contour}, \path{histo_nuc_type}, and \path{histo_nuc_type_prob}
 
For ISIC 2017 (dermatology), the toolbox is constructed from a lesion segmentation tool, dermoscopic structure maps, and color-based markers:
 \path{derm_lesion_segmenter}, \path{derm_pigment_network}, \path{derm_negative_network}, \path{derm_milia_like_cyst_detector}, \path{derm_streaks_detector}, \path{derm_marker_malignant_union}, and \path{derm_marker_browns}

For provide the VLM with a toolbox consisting with the union of all tools across modalities. 
\begin{lstlisting}
TOOLBOX = {
    "histo_nuc_centroid",
    "histo_nuc_bbox",
    "histo_nuc_contour",
    "histo_nuc_type",
    "histo_nuc_type_prob",
    "derm_lesion_segmenter",
    "derm_pigment_network",
    "derm_negative_network",
    "derm_milia_like_cyst_detector",
    "derm_streaks_detector",
    "derm_marker_malignant_union",
    "derm_marker_browns"
}
\end{lstlisting}

For both fixed and dynamic tool selection, we also provide MedGemma with a brief natural language description of each tool:
\begin{lstlisting}
TOOL_DESCRIPTIONS = {
  "histo_nuc_centroid":  "Returns each nucleus centroid in 
                         [x_center, y_center]",
  "histo_nuc_bbox":      "Returns each nucleus bounding box in 
                         [x_top_left, y_top_left, width, height]",
  "histo_nuc_contour":   "Returns the polygon points
                         tracing each nucleus boundary",
  "histo_nuc_type":      "Returns the predicted nucleus type label 
                         (0-5; e.g., epithelial, inflammatory, 
                         connective, dead, non-neoplastic 
                         epithelial)",
  "histo_nuc_type_prob": "Returns class-probability scores for the 
                          predicted nucleus type",

  "derm_lesion_segmenter":         "Segments lesion ROI",
  "derm_pigment_network":          "Detects reticular pigment 
                                    network",
  "derm_negative_network":         "Detects negative network (white 
                                    lines)",
  "derm_streaks_detector":         "Detects radial streaks or 
                                    pseudopods at edges",
  "derm_milia_like_cyst_detector": "Detects milia-like cysts (often 
                                    SK)",
  "derm_marker_malignant_union":   "Union of malignancy chromatic
                                    markers",
  "derm_marker_browns":            "Detects brown pigment regions"
}    
\end{lstlisting}

The base prompt template for MedGemma in the fixed tool selection setting is: 
\begin{lstlisting}
You are a medical expert in {modality}. Select tools for a single   
task from a fixed toolbox {TOOLBOX} described by {TOOL_DESCRIPTIONS}. 
Choices must depend on the task and image evidence.

Choose max {max_tools} tools from the toolbox {TOOLBOX}
that are most relevant for solving the task in each image; no 
duplicates.

Return ONLY JSON with the following fields:
- task_modality
- task
- selected_tools
- abstain  (boolean)

Each entry in selected_tools must include:
- id           (tool name from TOOLBOX)
- rank         (1..N, 1 = most important)
- confidence   (float in [0, 1])
- reason       (brief phrase tied to image cues)

If you are unsure about the modality or task, set task_modality = 
"unknown" and abstain = true.

Return ONLY JSON.
\end{lstlisting}
We then append a task-specific instruction depending on the modality and classification problem. 
For the Camelyon17 and ISIC 2017 BM/MN tasks, we use the following prompt instructions:
\begin{lstlisting}
Task: Determine if the central 32x32 region of this 96x96 
histopathology patch contains tumor or not. 
Answer ONLY in the following format:
label: tumor or no tumor
prob: <a real-valued prediction probability in [0,1] that the 
predicted label is correct>
\end{lstlisting}

\begin{lstlisting}
Task: Determine if the lesion in the dermoscopic image
is malignant or benign.
Answer ONLY in the following format:
label: malignant or benign
prob: <a real-valued prediction probability in [0,1] that the 
predicted label is correct>
\end{lstlisting}

\begin{lstlisting}
Task: Determine if the lesion in the dermoscopic image
is melanocytic or non-melanocytic.
Answer ONLY in the following format:
label: melanocytic or non-melanocytic
prob: <a real-valued prediction probability in [0,1] that the 
predicted label is correct>
\end{lstlisting}

\paragraph{Dynamic tool selection.} For dynamic tool selection , we instead prompt MedGemma to score every tool within the given modality  in the toolbox. Details of the dynamic selection procedure are described in Section~\ref{app:vlm_prompting}. The scoring prompt replaces the selection instructions above with
\begin{lstlisting}
You are a medical expert in {modality}. Score each tool in the 
provided toolbox {TOOLBOX} described by {TOOL_DESCRIPTIONS} between  
[0,1]. Base scores strictly on the visible image evidence and task relevance.
Return JSON ONLY with keys: task_modality, task, scores.
- scores must be a list of objects with: id (string), score (integer 
0...1).
- Provide exactly one score for each tool id you are given.
Scores do not need to be rounded numbers. No omission of scores.
\end{lstlisting}

The same toolbox, tool descriptions, and task-specific instructions are used as in the fixed-selection setting. The full prompt templates above correspond to the references in Section~\ref{sec:vlm} and Appendix~\ref{app:vlm_prompting}.

\subsection{LLaVA-Med Prompts}
\label{app:llavamed}

For our LLaVA-Med zero-shot baseline, we use the following prompt instructions for Camelyon17, ISIC-BM, and ISIC-MN tasks:
\begin{lstlisting}
Instruction:
You are a pathology assistant. Look at the central 32x32 region of 
this 96x96 histopathology patch and answer strictly: 
Is there tumor present in the central region or not? 
Answer ONLY in the following format:
label: tumor or no_tumor
prob: <a real-valued prediction probability in [0,1] that the 
predicted label is correct>
\end{lstlisting}

\begin{lstlisting}
Instruction:
You are a dermatology expert. Look at this dermoscopic image of a 
skin lesion.
Is the lesion benign or malignant?
Answer ONLY in the following format:
label: malignant or benign
prob: <a real-valued prediction probability in [0,1] that the 
predicted label is correct>
\end{lstlisting}

\begin{lstlisting}
Instruction:
You are a dermatology expert. Look at this dermoscopic image of a 
skin lesion.
Is the lesion melanocytic or not?
Answer ONLY in the following format:
label: melanocytic or non-melanocytic
prob: <a real-valued prediction probability in [0,1] that the 
predicted label is correct>
\end{lstlisting}

\section{Computational Cost}

% \textcolor{red}{ 
% We report predictor-only cost and end-to-end pipeline latency. Predictor-only cost measures a single forward pass of the final classifier (EfficientNet on RGB; TBM on tool maps) and is reported via parameters, FLOPs/image, and GPU latency. We report end-to-end latency  as well because FLOPs estimates for VLM tool selection and tool operations are not consistently defined, whereas token counts and measured runtime provide a better measure of deployment cost. For ISIC, the TBM’s encoder is a 4-block CNN trained from scratch, whereas EfficientNet-B0 is initialized with ImageNet pretraining; we note this difference when interpreting results.
% For the VLM, we use MedGemma-4B

% Note that in Table~\ref{tab:computation}, the FLOPs/image component for our VLM (MedGemma) excludes the vision encoder 
% }

\begin{table}[htbp]
\floatconts
    {tab:computation}
    {\caption{We report parameter count, FLOPs, and inference time per image for a single forward pass. FLOPs are reported per image.}}
    {\centering\input{tables/computation}}
\label{tab:computation}
\end{table}
