\begin{abstract}
% Human decision-making from medical images typically involves analyzing multiple, often spatially-localized features in the image most clinically relevant to the task at hand, and fusing these features to make a prediction (such as diagnosis).
% In contrast, deep learning approaches take the image directly as input, extracting features and performing prediction in an end-to-end fashion.
% Recently, large vision–language models (VLMs) for medical domains such as MedGemma incorporate medical knowledge, but are similarly hard to interpret. 
% Although well-performing, these models are known to be data-hungry, suffer from limited interpretability, and lack clinical grounding compared to human decision-making.
Recent tool-use frameworks powered by vision-language models (VLMs) improve image understanding by grounding model predictions with specialized tools.
Broadly, these frameworks leverage VLMs and a pre-specified toolbox to decompose the prediction task into multiple tool calls (often deep learning models) which are composed to make a prediction. 
The dominant approach to composing tools is using text, via function calls embedded in VLM-generated code or natural language.
However, these methods often perform poorly on medical image understanding, where salient information is encoded as spatially-localized features that are difficult to compose or fuse via text alone.
% \joy{Reading this, I was expecting the next sentence to highlight what makes our method different/novel compared to prior tool-use frameworks. We should highlight the novelty here.}
% Each subtask is solved by an external tool (often a neural network) from a pre-specified toolbox.
% where each tool extracts a task-relevant feature from the image, and then fusing these features to make a prediction.  
% Besides leading to better performance in certain settings, tool-use enables more interpretable, modular, and grounded models.
To address this, we propose a tool-use framework for medical image understanding called the \frameworkname{} (\frameworkabbr{}), which composes VLM-selected tools using a learned Tool Bottleneck Model (TBM).
For a given image and task, \frameworkabbr{} leverages an off-the-shelf medical VLM to select tools from a toolbox that each extract clinically-relevant features.
% An off-the-shelf medical VLM selects tools from the toolbox for a given task, and 
Instead of text-based composition, these tools are composed by the TBM, which computes and fuses the tool outputs using a neural network before outputting the final prediction. 
% Effectively, \frameworkabbr{} reformulates medical image understanding as VLM-guided, tool (clinically-relevant feature) use.
We propose a simple and effective strategy for TBMs to make predictions with any arbitrary VLM tool selection.
Overall, our framework not only improves tool-use in medical imaging contexts, but also yields more interpretable, clinically-grounded predictors.
% \ehsan{The contribution is not really clear! This does not excite the reader to continue reading the paper enthusiastically.}
% A VLM-guided tool-dropout mechanism conditions tool availability on VLM-derived priors, improving robustness to missing or uncertain tools while preserving the model’s ability to be directly intervened upon through its tool representations. 
% In addition to the benefits that derive from tool use, our novel TBM also builds predictors with task-specific inductive biases. 
We evaluate \frameworkabbr{} on tasks in histopathology and dermatology and find that these advantages enable our framework to perform on par with or better than deep learning-based classifiers, VLMs, and state-of-the-art tool-use frameworks, with particular gains in data-limited regimes. % while providing better interpretability.
The project details and the code are available at \url{https://christinaliu2020.github.io/tbm/}. %\url{https://github.com/christinaliu2020/tool-bottleneck-framework}.
\end{abstract}