\section{Detailed Prompt-Frame Results}
\label{sec:appendix_promptframe}
This appendix provides the structure-wise breakdown of prompt-frame accuracy to supplement the analysis in Section~\ref{sec:results_promptframe}. Table~\ref{tab:prompt_frame_dsc} summarizes the DSC across all structures, modalities, and prompt types. \input{tables/table_promptframe}
In CT, the gains for SAM~3 are substantial for anatomically small or low-contrast targets such as bladder, pancreas, esophagus, prostate, and spleen. Improvements are similarly pronounced in MRI, especially for cardiac structures where SAM~3 significantly outperforms SAM~2 for LV, RV, and myocardium under both single- and multi-click prompting. Multi-click prompting (1,2) reduces ambiguity for both models, yet SAM~3 retains a clear advantage in nearly all CT and MRI structures with statistically significant gains, frequently with $p < 0.001$.

For bounding-box prompts, where the spatial support is considerably less ambiguous, the performance gap narrows but does not disappear. SAM~3 continues to produce higher DSC for most CT and MRI structures, although with smaller margins. We also see some instances of SAM~2 performing slightly better than SAM~3 (e.g., MR Bladder and MR Hippocampus Posterior). Box prompts achieve the highest absolute accuracy for both models, and here the differences between the two models typically fall within a modest range ($\sim$5 DSC points) with the exception of MR Myocardium where SAM~3 beats SAM~2 by about 20 DSC points.

Ultrasound exhibits a mixed pattern. For segmentation of cardiac chambers in cine sequences (LA, LV endocardium, LV epicardium), SAM~3 achieves substantially higher DSC for click prompts, reflecting improved localization. In contrast, for the SegThy dataset (thyroid, carotid arteries, and jugular veins), both models exhibit near-total failure under click prompting, with DSCs frequently remaining in the single digits. Segmentation accuracy becomes meaningful only when bounding-box prompts are supplied; in this viable regime, SAM~2 consistently outperforms SAM~3 across the thyroid and all vascular targets.

For endoscopy (CholecSeg8K), SAM~3 shows an advantage under single-click (1,0) prompting, outperforming SAM~2 for the majority of the tissue and instrument classes. However, under multi-click (1,2) and bounding-box prompts, the results are more balanced: SAM~2 and SAM~3 each achieve higher DSC for different categories, and no model dominates across all structures. Notably, even when numerical differences are large between the two models, none of these comparisons reach statistical significance because CholecSeg8K contains only a small number of annotated videos, which limits the power of paired significance testing.