\section{Conclusion and Future Work}
\label{sec:conclusion}
We presented the \frameworkname{} (\frameworkabbr{}), a tool-use framework for clinically-informed and interpretable medical image understanding.
\frameworkabbr{} leverages a VLM to select tools from a pre-specified toolbox most relevant for the task at hand.
Unlike prior works that compose these tools via text, our framework composed selected tools using a learned Tool Bottleneck Model (TBM), which computes the tool outputs on the given image and fuses the tool outputs to make a prediction.
We present a simple yet effective strategy of training TBMs such that they accept any arbitrary subset of tools via tool knockout augmentation.
On tasks derived from histopathology and dermatology, we observe that \frameworkabbr{} outperforms state-of-the-art tool-use frameworks while being more interpretable and data-efficient as compared to CNNs and VLMs.
Additionally, we propose a way to interrogate tool importance for further interpretability.
% and neuro-symbolic approaches advantages of TBMs in the context of data efficiency and interpretability.
As a future direction, scaling our work to use a more comprehensive set of tools would yield a more general framework for medical image understanding. 
In addition, optimizing the tool selection VLM would also be an interesting direction for exploration.

% The performance and interpretability of TBMs are inherently limited by the quality of the tools they rely on.  Errors or biases in pretrained segmentation, detection, or feature extraction tools can propagate directly to the final predictions. In domains where high-quality, well-calibrated tools are unavailable, TBMs may underperform relative to end-to-end black box models. Similarly, as medical tools improve, our tool-use framework will natural scale in performance.



%Combining sequential and within-image decomposition?    
%Learning/discovering optimal tools?
%End-to-end training of VLM tool selector and TBM.
