\section{Results}
\label{sec:results}

\subsection{Main Results}
\label{sec:main-results}
We compare the performance of \frameworkabbr{} and its variants against all baselines for Camelyon17, ISIC-BM, and ISIC-MN in Table~\ref{tab:tbm_vs_baselines}. 

% We analyze the behavior of $\TBMe$ variants across encoder architectures and VLM-guided dropout strategies. 
% Table~\ref{tab:tbm_vs_baselines} compares these TBM variants against EfficientNet, Y-Net and MedGemma baselines. 
% We found that pretraining the tool encoder on ImageNet weights led to the best performing $\TBMe$ and $\TBMl$ variants. 

\begin{table}[t]
\floatconts
  {tab:tbm_vs_baselines}% label
  {\caption{\frameworkabbr{} performance against baselines and ablations across Camelyon17 (Accuracy), ISIC-BM (AUC), ISIC-MN (AUC). 
  Top group: zero-shot baselines with and without tools.
  Middle group: trained/fine-tuned baselines.
  Bottom group: our proposed framework and ablations.
  % $\TBMe$ and $\TBMl$ represent TBM with early and late fusion, respectively. 
  \textbf{Bolded} is best and \underline{underlined} is second best.
  For ablations, the delta is computed with respect to \frameworkabbr{} (ours). 
  }}
  {\centering\input{tables/tbm_vs_baselines.tex}}
\end{table}


% \begin{table}[t]
% \floatconts
%   {tab:tbm_ablations}% label
%   {\caption{Ablations}}
%   {\centering\input{tables/tbm_ablations.tex}}
% \end{table}

% On Camelyon17, we evaluate TBM and baseline methods on the dataset's Tumor vs.\ Normal task. 
% The tool encoder in the TBM variants for this task use EfficientNet backbones pretrained on ImageNet weights. 
Generally, we observe that \frameworkabbr{} outperforms or is on-par with baselines across all tasks we tested. 
% This is significant because TBM uses more domain-specific tools and is more interpretable, yet we don't sacrifice in performance.
\frameworkabbr{} is the best-performing model on Camelyon17 and ISIC-MN and is second-best for ISIC-BM, where EfficientNet performs the best.
Notably, \frameworkabbr{} outperforms Y-Net, despite both being trained on the same amount and types of image-level and pixel-level data.
We attribute this to the \frameworkabbr{} formulation -- since $f_\theta$ only takes as input VLM-selected, clinically-relevant features, we hypothesize that learning predictors on these features (instead of raw images) induces more robust predictions. 
Additionally, MedGemma, MedGemma w/ Tool Prompts, and VisProg both perform poorly (VisProg gives constant answers, resulting in an AUC of 0.5).
Since all models can use the same toolbox $\mathcal{T}$ but compose them through zero-shot, text-based mechanisms, this suggests the importance of tool-use frameworks that use learned composition mechanisms for best performance in medical imaging settings.
% , indicating that tool-based decomposition and grounding the representation in dermoscopic structures yields large performance gains over both purely segmentation-driven or zero-shot multimodal classification approaches. 
% This surpasses all baselines (EfficientNet, Y-Net, and MedGemma), while additionally providing explicit, tool-level interpretability.

% For the malignant vs.\ benign task, the pretrained black-box EfficientNet achieves the highest AUC (0.784), with the $\TBMe$--VLM Perturbed variant close behind at 0.770, and $\TBMe$ at 0.748. All TBM variants substantially outperform Y-Net (0.658) and the MedGemma VLM-only baseline (0.488), indicating that tool-based decomposition and explicitly grounding the representation of the visual input in domain-specific tool outputs significantly improves performance. 
% For the melanocytic vs.\ non-melanocytic task, TBM provides the strongest gains: $\TBMe$--VLM Perturbed attains the best overall AUC of 0.918, slightly surpassing both the black-box baseline (0.912), $\TBMe$ (0.911), and $\TBMe$--VLM (0.904). All $\TBMe$ variants outperform Y-Net (0.866) and far exceeding the MedGemma zero-shot baseline (0.499).
% This demonstrates that, even without any pretraining, TBM can match or exceed the performance of a pretrained black-box classifier when the available tools (lesion mask, dermoscopic structures, color markers) align closely with the underlying clinical decision.
% Across both tasks, the $\TBMl$ variants perform slightly worse than $\TBMe$ and the EfficientNet baseline, but still match or exceed the Y-Net and MedGemma baselines. 


% On ISIC, we evaluate Test AUC on two clinically relevant tasks: malignant vs.\ benign and melanocytic vs.\ non-melanocytic. 
% Notably, all ISIC TBM models are trained from scratch, whereas the EfficientNet baseline benefits from ImageNet pretraining, yet TBM still performs similarly to or exceeds its performance while providing explicit interpretability. 

The bottom of Table~\ref{tab:tbm_vs_baselines} depicts two ablation experiments.
First, we ablate the perturbation of VLM tool selections described in Section 3.3; this corresponds to using a value of $\alpha=1$.
Across all tasks, we observe a 1-2\% drop in performance without random perturbation of VLM-selected tools.
We hypothesize that perturbed VLM tool selection increases robustness of the TBM, since it sees a wider variety of tool combinations at training compared to without random perturbation and/or without tool selection. 
Second, we ablate the VLM tool selector and simply pass in all modality-specific tools. Note that in the case of large $N$, this is computationally intractable.
We observe that the TBM performs slightly worse when using all modality-specific tools than when using VLM tool selection.
% In principle, given infinite data and tools, increasing tools will outperform any subset.
% \joy{Let's describe this more precisely; we are using ``robustness'' too many times and too vaguely. }
% We attribute this to distribution shift or limited data. 
% Interestingly, this occurs despite the MedGemma Zero-shot baseline performing poorly (29.5\%). 
% When tasked to classify tumor vs.\ normal directly from the image and task description, MedGemma lacks the structured, pixel-level supervision and task-specific adaptation that TBM receives.  
% We take this as an indication of the advantage of using VLMs as tool selection instead of the direct predictor. % of explicit tool-based decomposition for fine-grained visual diagnosis and suggest that VLMs are more effective as tool selectors than as standalone classifiers.
% Furthermore, we observe that TBM-VLM Perturbed generally outperforms TBM-VLM. 
% This means that there is an advantage to making the TBM more robust to different tools during training, e.g. when there is distribution shift.

% Across early and late fusion TBM variants, the VLM--Perturbed variant consistently achieves 
% the best performance (89.2\% / 90.8\%). 
% By sampling masks that keep MedGemma-selected tools with high probability and others with low 
% probability, this variant exposes the model to a wide range of tool combinations, including ``OOD'' subsets representing cases where tools may be missing, uncertain, or noisy.  
% This improves robustness to variable tool availability at inference time. 

% Overall, $\TBMe$ matches or exceeds baseline performance while grounding its predictions in interpretable, tool-level features, and it outperforms both Y-Net and MedGemma on the ISIC 2017 dataset.

We refer the reader to Appendix \ref{app:effect_of_pretraining} for results using non-pretrained models. 


\subsection{Data Efficiency}
\label{sec:data-efficiency}
We hypothesize that \frameworkabbr{} has advantages in low-data regimes, since the architectural design of the \frameworkabbr{} encodes clinically-relevant inductive biases.
% \frameworkabbr{} leverages more clinically-relevant priors knowledge in its architectural design.
To assess this, we conduct data-efficiency experiments on both Camelyon17 and ISIC-MN (Figure~\ref{fig:data_efficiency}). For each dataset, we train $\TBMe$ and EfficientNet on increasingly larger subsets of the training set. % and report performance averaged over multiple random seeds, with 95\% confidence intervals. 
The subsets are chosen at random while ensuring that classes are balanced.

Across all training-set sizes, \frameworkabbr{} outperforms EfficientNet, with especially large gains in the small-data regime (4–64 images).
For example, with only four labeled examples of Camelyon17 images, \frameworkabbr{} reaches $\sim$0.64 accuracy while the black-box baseline reaches only $\sim$0.57. 
% As the training set size increases, both methods improve, but TBM maintains a consistent lead in performance and converges to a higher final accuracy. 
At larger training sizes, we also observe that \frameworkabbr{} exhibits lower variance across seeds, which we attribute to the reduced hypothesis space of $f_\theta$ due to its clinical grounding.
%tool-based supervision stabilizes optimization and mitigates overfitting when data is limited. 

% On the ISIC dataset, we perform this data efficiency experiment on the ISIC-MN task. 
% A similar trend holds, as at nearly all subset sizes, particularly in smaller training subset sizes, TBM AUC matches or exceeds the EfficientNet baseline, despite being trained from scratch while the EfficientNet baseline benefits from ImageNet pretraining. In lower data training settings, TBM achieves up to several points higher Test AUC. As dataset size increases, performance between the two models becomes closer, but TBM remains competitive or better. This demonstrates that structured tool outputs (lesion masks, dermoscopic features, and color marker-maps) provide strong, clinically aligned inductive bias. 

% Overall, these results show that TBMs are substantially data-efficient than black-box EfficientNet models. %, offering better performance and more stable training under low-data constraints.
% We show with the ISIC dataset that TBMs are able to achieve this without relying on large-scale pretraining, highlighting the value of using expert-level tool outputs as domain-informed priors.   

\begin{figure}[t]
\floatconts
  {fig:data_efficiency}% label for whole figure
  {\caption{Model performance of TBF vs.\ EfficientNet baseline over varying training set sizes in log scale. Mean $\pm$ 95\% CI over seeds. 
  \frameworkabbr{} exhibits improved performance across all training set sizes.
  }}% caption for whole figure
  {%
    \subfigure[Camelyon17]{%
      \label{fig:camelyon_data_efficiency}%
      \includegraphics[width=0.40\linewidth]{figs/fig_2/camelyon_data_efficiency.pdf}%
    }
    \subfigure[ISIC-MN]{%
      \label{fig:isic_data_efficiency}%
      \includegraphics[width=0.40\linewidth]{figs/fig_2/isic_data_efficiency.pdf}%
    }%
  }
\end{figure}


\subsection{Analysis}
\label{sec:interpretability}
We are interested in analyzing the ``importance'' of each tool for a given task and how that relates to the distribution of VLM tool selections during training.
To measure importance of a given tool, we knockout that tool while passing all other modality-specific tools, and compute the resulting change in performance averaged across the validation set. 
Specifically, the importance of tool $t_i$ is defined as:
\begin{equation}
\label{eq:tool_importance}
    \mathcal{I}(t_i) = \frac{1}{|\mathcal{D}_v|}\sum_{(\bm{x}, \bm{y}) \in \mathcal{D}_v} m\left(\text{TBM}(\bm{x}, \mathcal{T}), \bm{y}\right)- m\left(\text{TBM}(\bm{x}, \mathcal{T}_{-\{t_i\}}), \bm{y}\right),
\end{equation}
where $\mathcal{D}_v$ is the validation set, $m$ is the performance metric, and $\mathcal{T}_{-\{t_i\}}$ denotes the toolbox with $t_i$ dropped.
This is closely related to the notion of influence functions~\cite{koh2017understanding}.

We perform this analysis on Camelyon17 and ISIC-MN tasks (Fig.~\ref{fig:loto_freq_plots}), where $m$ is accuracy and AUC, respectively. 
Alongside tool importance, we also plot the normalized frequency of tool selections across the training set, since tool importance may be correlated with how frequently that tool is selected by the VLM during training.

We observe that for Camelyon, the nucleus contour tool has the highest importance, despite the VLM selecting the contour tool relatively less than nucleus bbox and centroid tools.
Similarly, for ISIC, we observe that the lesion segmentation tool, pigment network tool, and brown marker tool have the highest importance. 
% In contrast, ablating negative network, milia-like-cysts, streaks, or the malignant color markers produces only small changes in accuracy. 
This pattern is consistent with literature, where, for instance, nuclei count and irregularity are known to be correlated with pathology~\cite{kuenen1984prognostic}. 
Similarly, pigment networks and border irregularity (extracted by lesion segmentations) are important in diagnosing malignant melanoma~\cite{anantha2004detection,stolz1994abcd}. 


% \begin{figure}[htbp]
%  % Caption and label go in the first argument and the figure contents
%  % go in the second argument
% \floatconts
%   {fig:camelyon_nuc_dropout}
%   {\caption{Nuclei-dropout intervention experiment on Camelyon17. We progressively remove nucleus instances with probability $1-p_{\text{mask}}$ by randomly masking all nucleus-dependent tool maps (centroid, bounding box fill, contour, type, type probability.
%     Left: Classification accuracy decreases as more nuclei are removed
%     Middle: The predicted probability of a \textit{Normal} label increases monotonically with dropout.  
%     Right: Directional flip-rates show that significant numbers of predictions flip from Tumor$\rightarrow$Normal as $p_{\text{mask}}$ increases.}}
%   {\includegraphics[width=\linewidth]{figs/centroid_dropout.pdf}}
% \end{figure}
{
\setlength{\belowcaptionskip}{-1pt}
\begin{figure}[t]
\floatconts
  {fig:loto_freq_plots}
  {\vspace{-1.5\baselineskip}
  \caption{Tool-wise importance (Eq.~\ref{eq:tool_importance}) and normalized frequency of VLM tool selections for TBM across Camelyon17 \textbf{(left)} and ISIC-BM/-MN \textbf{(right)}. In each plot, the left axis shows the relative importance of each tool measured by the change in Accuracy (Camelyon17) and AUC (ISIC) when tools are individually removed during inference. The right axis shows the normalized frequency of tools selected by MedGemma during training.}}
  {\includegraphics[width=0.9\linewidth]
{figs/fig_3/combined_camelyon_isic_loto_toolfreq}}
\end{figure}


% \begin{figure}[t]
% \floatconts
%   {fig:loto_freq_plots}% label for whole figure
%   {\caption{Tool-wise importance (Eq.~\ref{eq:tool_importance}) and normalized frequency of tool selections for TBM. Bars show the change in Accuracy (Camelyon17) and AUC (ISIC) when each tool is removed at inference while all other tools are kept. 
%   \textbf{(a)} Camelyon17:  $\Delta\text{Acc}$
%   \textbf{(b)} ISIC-BM and ISIC-MN: $\Delta\text{AUC}$. Distribution of MedGemma selected tools across Train and Test splits.}}
%   {%
%     \subfigure[Camelyon17]{%
%       \label{fig:camelyon_loto}%
%     \includegraphics[width=0.43\linewidth]{figs/fig_4/camelyon_loto_toolfreq.png}%
%     }
%     \subfigure[ISIC]{%
%       \label{fig:isic_loto}%
%       \includegraphics[width=0.55\linewidth]{figs/fig_4/isic_loto_toolfreq.png}%
%     }%
%   }
% \end{figure}

% \begin{figure}[t]
% \floatconts
%   {fig:isic_loto}% label for whole figure
%   {\caption{Tool-wise importance analysis of TBM on Camelyon17 and ISIC using a leave-one-tool-out ablation. Bars show the change in Accuracy (Camelyon17) and AUC (ISIC) when each tool is removed at inference while all other tools are kept. 
%   \textbf{(a)} Camelyon17:  $\Delta\text{Acc}$
%   \textbf{(b)} ISIC-BM and ISIC-MN: $\Delta\text{AUC}$. Distribution of MedGemma selected tools across Train and Test splits.}}
%   {%
%     \subfigure[Camelyon17]{%
%       \label{fig:camelyon_loto}%
%     \includegraphics[width=0.23\linewidth]{figs/camelyon_loto.pdf}%
%     }
%     \subfigure[Camelyon17]{%
%       \label{fig:camelyon_loto}%
%       \includegraphics[width=0.23\linewidth]{figs/camelyon_tool_distr.pdf}%
%     }
%     \subfigure[ISIC]{%
%       \label{fig:isic_loto}%
%       \includegraphics[width=0.23\linewidth]{figs/isic_loto.pdf}%
%     }%
%     \subfigure[ISIC]{%
%       \label{fig:isic_loto}%
%       \includegraphics[width=0.23\linewidth]{figs/isic_tool_distr.pdf}%
%     }%
%   }
% \end{figure}

% \begin{figure}[t]
% \floatconts
%   {fig:isic_loto}% label for whole figure
%   {\caption{
%   Distribution of MedGemma selected tools across Train and Test splits.
% }}% caption for whole figure
%   {%
%     \subfigure[Camelyon17]{%
%       \label{fig:camelyon_loto}%
%       \includegraphics[width=0.48\linewidth]{figs/camelyon_tool_distr.pdf}%
%     }
%     \subfigure[ISIC]{%
%       \label{fig:isic_loto}%
%       \includegraphics[width=0.48\linewidth]{figs/isic_tool_distr.pdf}%
%     }%
%   }
% \end{figure}

% \1 Contour shape normalization
% \2 For each nucleus, with probability $p_{\text{norm}}$ we replace the true contour with a circle centered at the nucleus centroid and radius equal to half of the shorter side of its bounding box, leaving position and approximate size intact while removing shape irregularity. We sweep across $p_{\text{norm}} \in \{0.2, 0.4, 0.6, 0.8, 1.0\}$ and report:
% \3 $\mathrm{P}(\hat{y}=\text{Normal})$ vs.~$p_{\text{norm}}$
% \3 overall flip rate vs.~$p_{\text{norm}}$
% \2 This isolates the contribution of nuclei shape, independent of location/scale.
% \2 Our results showed that Increasing $p_{\text{norm}}$ increases \textit{Normal} predictions and flip-rate, with smaller accuracy drops than dropout, suggesting that shape irregularity contributes additional malignant evidence beyond location and size.
% \begin{outline}
% \1 Type Swap
% \2 For each nucleus, with probability $p_{\text{swap}}$ we replace its discrete type label with a uniformly sampled alternative from the remaining type set (keeping centroid, box, and contour unchanged). We sweep $p_{\text{swap}} \in \{0.0, 0.2, 0.4, 0.6, 0.8, 1.0\}$ and report:
% \3 $\mathrm{P}(\hat{y}=\text{Normal})$ vs.~$p_{\text{swap}}$
% \3 flip rate vs.~$p_{\text{swap}}$
% \2 This experiment probes sensitivity of predictions to nuclei types.


% \1 Across interventions, TBM predictions change in predictable, biologically consistent ways. We show the the modularity/controllability of TBM predictions via semantically meaningful tool-level edits. 
% \2 removing nuclei increases \textit{Normal} calls and induces \textit{Tumor} to \textit{Normal} flips
% \2 normalizing contours (reducing irregularity) shifts predictions toward \textit{Normal}
% \2 swapping types perturbs decisions in proportion to the swap rate. Unlike post-hoc explanations, these direct edits/manipuations to tool maps verify that TBM’s decisions are mediated by explicit, intervenable factors.


% \1 These experiments provide causal evidence that
% \2 malignant decisions depend on nucleus-level evidence density (specifically in the Camelyon17 dataset)
% \2 contour irregularity is a distinct predictive factor for \textit{Tumor} vs. \textit{Normal}
% \end{outline}

In Appendix~\ref{app:intervention}, we experiment with directly intervening on tool outputs as another method to probe \frameworkabbr{}'s decision-making.
In Appendix~\ref{app:tool_combinations}, we visualize all training-time tool combinations, not just overall selection frequency.