\section{Related Works}
\label{sec:rel-works}

CNNs~\cite{tan2019efficientnet,he2016deep,huang2017densely,szegedy2015going,howard2017mobilenets} and ViTs~\cite{dosovitskiy2021an,touvron2021training,liu2021swin} remain the dominant backbone for medical image diagnosis. %benefiting from transfer learning from large-scale natural image datasets \cite{chen2025review, mienye2025deep, liu2021advances}. 
These black-box, highly-parametrized architectures achieve strong performance but are known to be data-hungry~\cite{sun2017revisiting,bello2022rethinking} and lack interpretability~\cite{selvaraju2017grad}.
% Post-hoc interpretability methods include saliency maps or Grad-CAM-style heatmaps \cite{selvaraju2017grad}. %, which highlight important regions but do not guarantee that the underlying features correspond to clinically meaningful concepts, limiting their suitability for audit, intervention, and clinical trust.
%
More recently, medical vision-language models (VLMs) have further increased data scale, leveraging large-scale pretraining on image-text pairs to enable flexible zero- and few-shot generalization for report generation, diagnosis, and visual question answering \cite{ryu2025vision} across multiple tasks and modalities. 
% Examples such as MedGemma \cite{sellergren2025medgemma} extend general-purpose multimodal backbones to radiology, pathology, dermatology, and related domains, while biomedical VLMs like MedCLIP \cite{wang2022medclip}, BioMedCLIP \cite{zhang2024biomedclip}, and EyeCLIP \cite{shi2025multimodal} adapt contrastive and cross-modal pretraining to medical data. 
Examples include MedCLIP \cite{wang2022medclip}, BioMedCLIP \cite{zhang2024biomedclip}, EyeCLIP \cite{shi2025multimodal}, and MedGemma \cite{sellergren2025medgemma}. 
%Despite their versatility, these models remain largely opaque: their internal reasoning cannot be directly linked to verifiable visual evidence, and task-specific adaptation still demands substantial labeled data and compute.
% TBMs instead offer a complementary approach by treating such foundation models as external priors for domain- and task-specific tool selection, rather than monolithic predictors. In this view, pretrained VLMs and related models can help identify or design specialized visual tools (e.g., segmenters or detectors) whose structured outputs feed a shared classifier, preserving trainability and domain specificity while exposing intermediate, clinically grounded features. TBMs therefore help bridge the gap between the broad generalization capacity of foundation vision-language models and the fine-grained, interpretable reasoning required in safety-critical clinical applications.

Our work is inspired by Concept Bottleneck Models (CBMs)~\cite{koh2020concept}, which predict a concept layer consisting of multiple image-level concepts for a given image; these concepts are then fused to arrive at a final prediction.
Thus, information flow is effectively ``bottlenecked'' through the concept layer.
CBMs have been widely explored in medical imaging contexts~\cite{marcinkevics2024interpretable,Pan_Integrating_MICCAI2024,kim2023concept}, but crucially restrict the bottleneck layer to image-level attributes, precluding the use of features that are spatially-localized. 
% Importantly, these concepts are scalar-valued. % and the encoders for these concepts require training with ground-truth annotations.
% CBMs are trained under three regimes—\emph{independent}, \emph{sequential}, and \emph{joint}—which differ in how the concept-prediction stage and label-prediction stage are optimized. 
% Independent CBMs train the concept predictor and label predictor separately, sequential CBMs first fit the concept predictor and then train the label predictor on frozen concept outputs, and joint CBMs optimize both stages end-to-end with a combined loss. 
At the cost of additional annotation, this two-stage design improves interpretability and enables test-time human intervention by allowing edits to predicted concepts. 
Extensions to CBMs include Label-Free CBMs \cite{oikarinen2023label}, Concept Embedding Models \cite{espinosa2022concept}, Graph CBMs \cite{xu2025graph}, probabilistic CBMs~\cite{kim2023probabilistic}, and works that integrate language models or VLMs to generate or align concepts~\cite{yang2023language, oikarinen2023label, yan2023robust, hsu2025makes, srivastava2024vlg, liu20242024, gao2024aligning, patricio2025two}.
% but it also requires costly manual concept labels that can be noisy or infeasible in domains like medicine, making CBMs fragile under annotation errors. 
% To address these limitations, Label-Free CBMs \cite{oikarinen2023label} transform arbitrary neural networks into interpretable CBMs without labeled concept data, improving scalability while retaining accuracy and interpretability, and Concept Embedding Models \cite{espinosa2022concept} learn high-dimensional, interpretable concept representations that relax the need for complete concept annotations. Graph CBMs \cite{xu2025graph} further introduce self-supervised learning over latent concept graphs to capture explicit relationships among concepts, while probabilistic CBMs model uncertainty over concept predictions to make interpretations more reliable under noisy or ambiguous inputs \cite{kim2023probabilistic}. Recent work also uses large language models to generate or ground concepts and thereby reduce manual concept annotation \cite{yang2023language, oikarinen2023label, yan2023robust, hsu2025makes}, and extends this idea with vision-language models that guide concept prediction to better align concepts with visual features \cite{srivastava2024vlg, liu20242024, gao2024aligning, patricio2025two}.


% % However, all three training strategies require ground-truth concept annotations during training, which is a major limitation in clinical settings where detailed concept labels (e.g., fine-grained morphological attributes) are scarce, costly to obtain, or simply ill-defined. 
% % Additionally, CBMs rely on these discrete, human-curated concepts to be well-defined and semantically aligned with task labels in order to achieve strong performance and capture the richness of visual information present in images.
% TBMs generalize CBMs by replacing manually annotated concepts with outputs from pretrained, expert-level tools, such as segmentation or detection models, which provide structured pixel- or image-level features. 
% % Unlike CBMs, TBMs can be trained end-to-end on downstream tasks even when explicit concept labels are unavailable, combining the interpretability of structured bottlenecks with the flexibility of modern vision architectures.
% Prior work on segmentation-based medical image classification similarly combines pixel-level, clinically relevant features with downstream classifiers by using pretrained segmentation networks to provide structural guidance, improving interpretability and robustness to spurious correlations \cite{wong2018building, hooper2023case, rayed2024deep}. More broadly, reviews of medical image segmentation highlight the proliferation of pretrained tools (e.g., for segmentation and detection) and their widespread integration into medical imaging pipelines \cite{gao2025medical}. Y-net \cite{mehta2018net} introduces a joint segmentation–classification architecture that uses intermediate pixel-level supervision to improve data efficiency and generalization in medical diagnosis tasks. These models train segmentation and classification branches together to highlight diagnostically relevant regions, but remain monolithic and task-specific, lacking modularity and explicit interpretability, and cannot reason over heterogeneous tool outputs (e.g., multiple independent segmentation or detection models). Since their internal representations are not explicitly structured, such architectures do not naturally support model intervention, causal inspection, or attribution of predictions to specific intermediate components. TBMs extend this segmentation-for-classification paradigm by allowing multiple independently pretrained pixel-level tools to feed structured representations into a shared classifier, while retaining modularity and interpretability at the level of individual tools.

More recently, neuro-symbolic approaches have demonstrated potential in grounding AI models in symbols, where a symbol can be represented by a tool~\cite{hsu2023s}. 
One line of work explores programmatic execution of tools for complex visual understanding~\cite{suris2023vipergpt,gupta2023visual}.
These approaches leverage a toolbox of pretrained (image-level or pixel-level) tools (e.g., segmentation, detection, object-level tools) and generate code that calls these tools, which is subsequently executed to make a prediction. 
Besides outperforming non-symbolic visual understanding approaches, these works can be seen as grounding the predictor via the tools that are by themselves interpretable~\cite{hu2024visual,lu2025rsvp}.
% ViperGPT \cite{suris2023vipergpt} and related approaches \cite{gupta2023visual, hsu2023s} treat visual reasoning as code generation, using a language model to compose calls to a library of vision tools and then executing the resulting program to answer queries. 
% These modular, programmatic designs enable interpretable intermediate computations, support compositional reasoning, and often exhibit strong zero-shot performance across diverse visual tasks. 
% Building on this paradigm, Visual Program Distillation \cite{hu2024visual} seeks to distill the modular reasoning and tool-use behavior of such systems into a single vision-language model, aiming to retain interpretability and compositionality while simplifying deployment. 
% In parallel, multimodal reasoning frameworks that integrate grounded segmentation with language-based symbolic-style reasoning show that coupling structured visual outputs with linguistic reasoning can further enhance zero-shot visual understanding and reasoning capabilities \cite{lu2025rsvp}.

Our work builds on both CBMs and neuro-symbolic approaches.
Instead of a bottleneck layer composed of scalar concepts, our TBM formulation generalizes such works to handle structured features, such as pixel-level features that encode spatially localized information, which is essential in medical imaging.
Instead of neuro-symbolic approaches that generate text for composing tool outputs, our TBM can be viewed as a learned tool composition mechanism, which is better equipped to fuse fine-grained, spatially localized features common in medical imaging.


% Standard convolutional neural networks (CNNs) trained directly on image-level labels remain strong baselines in medical imaging tasks. However, they are inherently ``black-box'' systems that are non-interpretable: they cannot incorporate external domain knowledge, and their learned features cannot be meaningfully examined or verified by humans. Post-hoc interpretability techniques such as Grad-CAM or attention visualization provide limited insight and are not causally linked to the model’s reasoning. Black-box models are also prone to rely on spurious correlations and often degrade sharply under distribution shift, since their internal features are not constrained to correspond to clinically meaningful cues. In contrast, TBMs explicitly ground predictions in human-verifiable tool outputs, which leads to more robust and reliable generalization under domain shift.


% \begin{outline}
% \1 Concept Bottleneck Models
% \2 CBMs explicitly structure the prediction pipeline into two stages: first predicting human-defined concepts from the input, and then predicting the final label from those concepts.
% \2 This design improves interpretability and allows for interventions on concept values, but requires manual concept annotation and concept-level supervision, which is costly and often unavailable in real-world domains such as medicine. 
% \2 Moreover, CBMs rely on discrete, human-curated concepts. Performance depends on concept-label alignment and concepts being well-defined in order to capture the richness of visual information present in images.
% \2 TBMs generalize CBMs by replacing manually annotated concepts with outputs from pretrained, expert-level tools, such as segmentation or detection models, which provide structured pixel-level or image-level features.
% \2 Unlike CBMs, TBMs can be trained end-to-end on downstream tasks even when explicit concept labels are unavailable.

% \1 Segmentation-for-Classification
% \2 Segmentation-based classification architectures, such as Y-Net, leverage intermediate pixel-level supervision to improve generalization and data efficiency.
% \2 However, these methods are typically monolithic and task-specific: the segmentation and classification components are trained jointly, but not modularly. 
% \2 Y-Net and related methods do not support reasoning across heterogeneous tool outputs (e.g. multiple segmenters, detectors, or regressors) and lack explicit interpretability. Furthermore, these methods do not allow for model intervention because they lack the ability to attribute predictions to specific intermediate components, 
% \2 TBMs extend this paradigm by allowing multiple types of pixel-level tools (each potentially pretrained independently) to contribute structured information to a shared classifier, while retaining modularity and interpretability.

% \1 Vision-Language and Large Multimodal Models (VLMs)
% \2 Foundation models such as MedGemma and GPT-4o exhibit remarkable zero-shot generalization and can use external knowledge to perform medical reasoning from images and text. In particular, MedGemma is pretrained on diverse medical imagery (e.g. chest X-rays, digital pathology, dermatology, fundus) and textual clinical data, and is adaptable to tasks such as medical image classification and image-based report generation or medical visual question answering.
% \2 However, these models are not explicitly interpretable or human-verifiable - their reasoning steps and internal representations cannot be directly linked to visual evidence. Moreover, they cannot be fine-tuned for specific diagnostic tasks without extensive computational resources or labeled data.
% \2 In contrast, TBMs can incorporate these foundation models as external "priors" for domain and task-specific tool selection, while retaining trainability, domain specificity, and human-verifiable interpretability.
% \2 ViperGPT
% \2 Foundation Models (?)

% \1 Interpretability (Black-box Methods)
% \2 Standard CNNs trained end-to-end on image-level labels are strong baselines but are inherently not interpretable. They cannot incorporate external domain knowledge or structured tool outputs, and their learned features are not directly interpretable or verifiable by humans.
% \2 TBMs address these limitations by explicitly grounding predictions in domain-specific tool outputs, leading to improved interpretability and out-of-distribution generalization.

% \end{outline}