\section{Scope of comparison vis-\`a-vis supervised baselines}
\label{sec:appendix_supervised_models}
The main study is a controlled, zero-shot comparison of SAM 2 and SAM 3 models on 3D medical data under identical visual prompts. This design isolates differences in prompt interpretation and propagation dynamics and avoids conflating model behavior with medical training data, dataset curation, or task-specific optimization.

Under this scope, models trained on labeled medical data, whether framed as 3D medical foundation models (e.g., MedSAM2~\cite{ma2025medsam2}) or task-specific supervised models (e.g., nnU-Net~\cite{isensee2021nnu}), fall outside the study setting because they are not zero-shot and their performance reflects domain training and dataset choices in addition to architecture. We also exclude 2D-only promptable models such as MedSAM~\cite{ma2024segment}, since they do not natively support  3D data. Incorporating them into our evaluation would require adding an external propagation mechanism, introducing an additional algorithmic component that would confound attribution. Moreover, task-specific models like nnU-Net~\cite{isensee2021nnu} are typically trained per dataset, whereas our evaluation targets a single promptable model that can be applied uniformly across datasets and targets without retraining.

Beyond the general protocol mismatch, MedSAM2 introduces two practical comparability issues. First, it is commonly restricted to a bounding-box-centric prompting interface, which does not align with our multi-prompt evaluation and the corresponding analyses of prompt strength. Second, rigorous multi-dataset benchmarking benefits from clear documentation of training dataset composition and splits to verify overlap assumptions; at the time of our study, these details were not fully accessible for MedSAM2 in a way that supports strict overlap verification for our evaluation. This limited transparency complicates data provenance checks, i.e., tracing exactly which source datasets and subsets contributed to training, making it difficult to rule out inadvertent overlap or leakage when evaluating across multiple public benchmarks.
