\documentclass{midl} % Include author names

% The following packages will be automatically loaded:
% jmlr, amsmath, amssymb, natbib, graphicx, url, algorithm2e
% ifoddpage, relsize and probably more
% make sure they are installed with your latex distribution

%%%%%%%%%%%%% Extra packages 
\usepackage{multirow}
\usepackage{booktabs}   % for \toprule, \midrule, \bottomrule
\usepackage{array}      % for better column formatting
% \usepackage[ruled,vlined,linesnumbered]{algorithm2e}

\usepackage[table]{xcolor}     % for \cellcolor in tables
\usepackage{soul}              % for \hl highlighting
\definecolor{sigStrong}{RGB}{220,255,220} % strong significance p < 0.001
\definecolor{sigWeak}{RGB}{255,255,204}   % weak significance 0.001 < p < 0.05

\definecolor{samTwoClickOneZero}{HTML}{AEC6CF}
\definecolor{samTwoClickOneTwo}{HTML}{77DD77}
\definecolor{samTwoBBox}{HTML}{FFB347}
\definecolor{samTwoMask}{HTML}{FF6961}

\definecolor{samThreeClickOneZero}{HTML}{B39EB5}
\definecolor{samThreeClickOneTwo}{HTML}{FFDAC1}
\definecolor{samThreeBBox}{HTML}{E2F0CB}
\definecolor{samThreeMask}{HTML}{C9C9FF}

\definecolor{gtMask}{HTML}{2CA02C}  % Ground-truth (GT) green
\definecolor{samTwoBase}{HTML}{1F77B4}  % Matplotlib Blue
\definecolor{samThreeBase}{HTML}{FF7F0E} % Matplotlib Orange

% Make LaTeX much less eager to create float-only pages
\renewcommand{\floatpagefraction}{0.9}      % for single-column floats
\renewcommand{\dblfloatpagefraction}{0.9}   % for double-column floats (if needed)

% Allow floats to take a lot of space on *text* pages
\renewcommand{\topfraction}{0.95}
\renewcommand{\bottomfraction}{0.95}
\renewcommand{\textfraction}{0.02}
%%%%%%%%%%%%%

\jmlrvolume{-- 26}
\jmlryear{2026}
\jmlrworkshop{Full Paper -- MIDL 2026}
\editors{Accepted for publication at MIDL 2026}

\title[SAM 2 vs. SAM 3 for Zero-Shot Segmentation of 3D Medical Data]{Comparing SAM 2 and SAM 3 for Zero-Shot Segmentation of 3D Medical Data}

% More complicate cases, e.g. with dual affiliations and joint authorship
\midlauthor{\Name{Satrajit Chakrabarty\nametag{$^{1}$}} \orcid{0000-0002-9664-4470} \Email{satrajit.chakrabarty@gehealthcare.com}\\
\Name{Ravi Soni\nametag{$^{1}$}} \Email{ravi.soni@gehealthcare.com}\\
\addr $^{1}$ GE HealthCare, San Ramon, CA, USA \\
}

\begin{document}

\maketitle

\begin{abstract}
Foundation models, such as the Segment Anything Model (SAM), have heightened interest in promptable zero-shot segmentation. Although these models perform strongly on natural images, their behavior on medical data remains insufficiently characterized. While SAM 2 has been widely adopted for annotation in 3D medical workflows, the recently released SAM 3 introduces a new architecture that may change how visual prompts are interpreted and propagated. Therefore, to assess whether SAM 3 can serve as an out-of-the-box replacement for SAM 2 for zero-shot segmentation of 3D medical data, we present the first controlled comparison of both models by evaluating SAM 3 in its Promptable Visual Segmentation (PVS) mode using a variety of prompting strategies. We benchmark on 16 public datasets (CT, MRI, Ultrasound, endoscopy) covering 54 anatomical structures, pathologies, and surgical instruments. We further quantify three failure modes: prompt-frame over-segmentation, over-propagation after object disappearance, and temporal retention of well-initialized predictions. Our results show that SAM 3 is consistently stronger under click prompting across modalities, with fewer prompt-frame over-segmentation failures and slower prediction retention decay compared to SAM 2. Under bounding-box and mask prompts, performance gaps narrow in few structures of CT/MR and the models trade off termination behavior, while SAM~3 remains stronger on ultrasound and endoscopy sequences. The overall results position SAM 3 as the superior default choice for most medical segmentation tasks, while clarifying when SAM 2 remains a preferable propagator.
\end{abstract}

\begin{keywords}
Foundation models, Segment Anything Model, Zero-shot segmentation, SAM 2, SAM 3.
\end{keywords}

\section{Introduction}
Foundation models for promptable segmentation have reshaped interactive medical image analysis. The Segment Anything Model (SAM) \cite{ref:sam1} introduced a general-purpose framework for zero-shot segmentation of 2D images using point, box, and mask prompts. SAM~2 \cite{ravi2024sam} extended this approach to videos and 3D-like sequences with a memory-based transformer for frame-to-frame propagation, enabling consistent segmentation across volumes and cine series. The most recent iteration, SAM~3 \cite{carion2025sam3segmentconcepts} introduces a unified Perception Encoder and a DETR (DEtection TRansformer)-style detector--tracker~\cite{carion2020end}, and adds concept-level prompting modules for open-vocabulary segmentation. 

Several medical variants of the SAM-family models have been proposed through domain-specific training, including supervised adaptation on curated medical corpora (e.g., MedSAM~\cite{ma2024segment}, MedSAM2~\cite{ma2025medsam2}) and synthetic-data-driven training that reduces reliance on real annotations (e.g., SynthFM~\cite{sengupta2025synthfm}, SynthFM-3D~\cite{chakrabarty2026synthfm3d}). However, the original SAM-family models remain practically relevant in medical workflows when the goal is interactive, human-in-the-loop annotation or rapid dataset bootstrapping rather than fully automated clinical segmentation. In such cold-start settings involving new anatomies, new devices, or new sites, labeled training sets may be unavailable, and a promptable zero-shot model can produce a first-pass mask that a human corrects to efficiently generate training data for downstream supervised models.

In this context, an important open question is whether SAM~3 can serve as an out-of-the-box replacement for SAM~2 under purely visual prompting. SAM~3 supports two regimes: Promptable Concept Segmentation (PCS), which enables concept/text-conditioned outputs, and Promptable Visual Segmentation (PVS), which operates from purely visual prompts \cite{carion2025sam3segmentconcepts}. Although PCS expands the model's scope, the architectural changes introduced in SAM~3 may also alter how visual prompts are interpreted and how masks are propagated over long medical sequences. To answer this question, we conduct a large-scale, controlled comparison of SAM~2 and SAM~3 across sixteen public datasets spanning CT, MRI, ultrasound, and endoscopy, covering 54 anatomical structures, pathologies, and surgical instruments. We evaluate SAM~3 in the PVS regime using only visual prompts and no concept prompts (i.e., no text or exemplar inputs), so that both models operate under matched prompting and propagation conditions. We benchmark single-click, multi-click, bounding-box, and mask prompts applied only to the first frame. Beyond prompt-frame and full-volume performance, we quantify three failure modes that are critical for interactive workflows: prompt-frame over-segmentation (poor initialization), temporal retention (forgetting), and over-propagation after object disappearance.

This study makes three main contributions:
\begin{itemize}
    \item A unified, cross-modality evaluation framework for comparing SAM~2 and SAM~3 under identical visual prompts
    \item A comprehensive empirical characterization of prompt-frame and full-volume/sequence performance across sixteen datasets, spanning four modalities and 54 targets, under single-click, multi-click, bounding-box, and mask prompting.
    \item A cross-model failure-mode analysis that quantifies prompt-frame over-segmentation, temporal decay of prediction, and over-propagation, providing the first systematic evidence on when SAM~3 can serve as an out-of-the-box replacement for SAM~2 and when SAM~2 remains the more conservative propagator.
\end{itemize}
By isolating visual-prompt behaviour and conducting extensive cross-modality experiments, this work clarifies the complementary strengths of SAM~2 and SAM~3 and provides practical guidance for selecting between these models in clinical and research settings.

\section{Methods}
\subsection{Comparison rationale and scope}
The objective of this study is to compare SAM~2 and SAM~3 under controlled and identical prompting conditions for medical image segmentation in 3D volumes and medical video sequences. SAM~3 expands the SAM~2 family with both PVS and PCS~\cite{carion2025sam3segmentconcepts}. Because our goal is a controlled comparison under the same user interaction assumptions used in medical annotation workflows, we evaluate both models using only visual prompts (points, boxes, masks) and identical first-frame initialization followed by forward propagation. This scope isolates differences in visual prompt interpretation and propagation dynamics, and avoids introducing unmatched semantic inputs that would confound attribution. A key motivation for our controlled study is that it provides architectural and prompting insights that can inform future medically adapted variants of SAM~3, including design choices around PVS-style prompting and propagation behavior.

This scope also defines what constitutes a fair baseline: we study the zero-shot behavior of the released SAM-family models only on 3D medical data without any domain-specific training or task optimization. Accordingly, methods whose performance is intrinsically tied to medical training data or task-specific supervision are not included as baselines in our study, including 2D-only promptable models (e.g., MedSAM~\cite{ma2024segment}), medically fine-tuned SAM 2 derivatives (e.g., MedSAM2~\cite{ma2025medsam2}), and fully supervised task-trained pipelines (e.g., nnU-Net~\cite{isensee2021nnu}). Since a fair comparison to supervised baselines would also require medical fine-tuning of SAM~2/SAM~3 under matched data and protocols, we focus our analysis on stress-testing out-of-the-box prompting and propagation behavior. Further discussion of scope and comparability assumptions is provided in Appendix~\ref{sec:appendix_supervised_models}.


\subsection{Model Overview}
SAM~2~\cite{ravi2024sam} is an encoder--decoder architecture built on the Hiera (Hierarchical Vision Transformer) backbone~\cite{ryali2023hiera}. Its defining feature is a streaming memory mechanism designed for semi-supervised video object segmentation where a memory bank stores features and masks from past frames, and a memory-attention module aggregates these to enforce spatio-temporal consistency during propagation through a 3D volume or cine sequence. %SAM~3~\cite{carion2025sam3segmentconcepts} instead uses a unified Perception Encoder and a DETR-style (DEtection TRansformer) detector--tracker~\cite{carion2020end} with learnable object queries for localization and tracking, and additionally supports concept-level prompting and presence prediction.
SAM~3~\cite{carion2025sam3segmentconcepts} uses a unified Perception Encoder shared by a DETR-style detector and a tracker. The detector follows the DETR paradigm with learnable object queries for localization/association~\cite{carion2020end}, while the tracker inherits the SAM~2 transformer encoder-decoder for video segmentation and interactive refinement and retains a SAM~2-style propagation mechanism with a memory encoder and memory bank.

In this work, we evaluate SAM~3 in the PVS mode, so that SAM~3 operates as a visual promptable tracker and segmenter under the same interaction protocol as SAM~2. Importantly, in all our experiments, this mode is realized by inference-path selection where we use the official tracker-based visual-prompt interface released by the authors so that concept-conditioning modules are not exercised. Appendix~\ref{sec:appendix_sam3_disable_concepts} documents the exact inference interface we build upon.

Under our fixed visual-prompt protocol, cross-model differences are interpreted through (i) the learned visual representation (Hiera backbone versus the unified Perception Encoder) and (ii) the propagation dynamics induced by the two tracking/memory formulations, which together govern initialization quality, retention, and termination over long sequences. More details on their differences, and a component-level summary linking these differences to different failure modes discussed in the paper, are provided in Appendix~\ref{sec:appendix_sam2_sam3_arch}.


\subsection{Prompting Strategy}
We evaluate three standard visual prompting strategies: (i) \emph{click prompting}, using either a single positive click (1,0) or a mixed positive--negative configuration (1,2), where the positive click is placed near the centroid of the target and negative clicks are sampled from a dilated region around the structure; (ii) \emph{bounding-box prompting}, where a tight axis-aligned box around the ground-truth structure in the first frame provides coarse geometric context; and (iii) \emph{mask prompting}, where a binary ground-truth mask from the first frame in which the structure appears is supplied as the initial prompt. All prompts are provided only on the first frame. Thereafter, the models receive no iterative prompting or corrective interactions and propagate their predictions sequentially from the first to the last frame without forward--backward refinement, temporal smoothing, or post-processing. For click prompts, we avoid oracle-style interactive prompting (i.e., placing subsequent clicks using ground-truth error regions conditioned on a model's intermediate predictions) and instead use a fixed, model-independent initialization prompt so that SAM~2 and SAM~3 receive identical click inputs. Full details of the prompting protocol and a click-jitter robustness study are provided in Appendix~\ref{sec:appendix_click_protocol}. 

\subsection{Datasets and Implementation Details}
\input{tables/table_dataset}
We evaluate SAM~2 and SAM~3 on sixteen publicly available medical imaging datasets spanning four imaging modalities: 3D CT, 3D MRI, ultrasound (2D cine and 3D volumes), and endoscopy video (Table~\ref{tab:dataset-description}, Figure~\ref{fig:dataset_figure}). All evaluated datasets are either 3D volumes or temporal sequences; we do not include any 2D datasets.
Our data selection covers a broad spectrum of anatomical structures, pathological conditions, and clinical instruments across modalities, ensuring that the evaluation reflects the diversity encountered in real-world clinical imaging workflows. The CT cohorts include multi-organ abdominal benchmarks (AMOS~\cite{ref:amos}, BTCV~\cite{landman2015miccai}, FLARE22~\cite{ma2024unleashing}, TotalSegmentator~\cite{ref:total}) together with oncologic and thoracic tasks from the MSD collection (lung tumors, pancreas and pancreatic tumors, spleen, and colon cancer)~\cite{antonelli2022medical,simpson2019large}. \input{figures/figure_dataset} MRI coverage comes from AMOS22~\cite{ref:amos}, ACDC~\cite{bernard2018deep}, MSD Task02 Heart and Task04 Hippocampus~\cite{antonelli2022medical,simpson2019large}, and the MRI subset of TotalSegmentator~\cite{totalsegmentatormri}. Ultrasound is represented by cardiac cine sequences from CAMUS~\cite{ref:camus} and 3D thyroid ultrasound from SegThy~\cite{kronke2022tracked}, while CholecSeg8K~\cite{hong2020cholecseg8k,twinanda2016endonet} provides endoscopy video frames with organ and instrument labels. Following prior work that treats 3D medical volumes as video-like slice sequences for SAM-style models, where each slice can be treated as a frame, we represent each 3D scan as an ordered sequence of 2D slices and process it sequentially \cite{dong2024segment,ma2025medsam2}. Segmentation accuracy is measured using the Dice similarity coefficient (DSC). Statistical significance is assessed using paired Wilcoxon signed-rank tests on video/volume-level DSC, with significance defined at $\alpha = 0.05$. All preprocessing and checkpoint details are provided in Appendix~\ref{sec:appendix_preproc}.

\subsection{Failure Mode Analysis}
\label{sec:failure_methods}
To complement volume-level DSC, we quantify three failure modes at a \textit{case} level, where \textit{case} is a single target structure within a volume per dataset. All metrics are computed per case and summarized as distributions across all cases, stratified by prompting mode.

\noindent\textbf{\textit{1. Prompt-frame oversegmentation (flooding).}}
To measure whether the model accurately resolves the target's spatial extent or ``floods'' into the background, we compute an \emph{area ratio} on the prompt frame $t_0$. Let $M_{gt}^{(t_0)}$ and $M_{pred}^{(t_0)}$ denote the ground-truth and predicted binary masks at $t_0$, and $|\cdot|$ the foreground pixel count. For all cases with $|M_{gt}^{(t_0)}|>0$ we define
\begin{equation}
R = \frac{|M_{pred}^{(t_0)}|}{|M_{gt}^{(t_0)}|},
\end{equation}
and analyze both the distribution of $R$ and the fraction of \emph{severe flooding} events ($R > 2$).

\noindent\textbf{\textit{2. Temporal retention (forgetting).}}
To measure how segmentation quality evolves across a volume/sequence while the object is present, we model the decay of Dice over the object's lifespan. For each case, we consider all frames where the ground-truth mask is non-empty and DSC is defined, re-index the frame IDs to a normalized time variable $\tau \in [0,1]$, and fit a simple linear model, $DSC(\tau) \approx \alpha + \beta\,\tau$. The \emph{normalized decay slope} $\beta$ serves as a retention score: values closer to zero indicate stable performance, whereas more negative $\beta$ correspond to faster forgetting. We compute $\beta$ for all cases as well as focus on a subset of cases with good initialization (prompt-frame $DSC \ge 0.7$).

\noindent\textbf{\textit{3. Over-propagation after object disappearance.}}
To quantify how long a model continues to hallucinate a mask after the physical object has disappeared, we count the number of \emph{over-propagated frames}. In our evaluation, each case is an ordered sequence/volume with frames indexed by $t$. Let the target ground-truth mask at frame $t$ be $M_{gt}^{(t)}$ and the model prediction mask be $M_{pred}^{(t)}$. Let $t_{\text{last}}$ denote the final frame where the target ground-truth mask is non-empty ($M_{gt}^{(t_{\text{last}})}\neq\emptyset$). The over-propagation length for a case is then
\[
L = \#\{t > t_{\text{last}} \mid M_{gt}^{(t)}=\emptyset \ \wedge\ M_{pred}^{(t)} \text{ is non-empty}\},
\]
i.e., the number of frames after $t_{\text{last}}$ where the target ground-truth mask is empty but the model continues to hallucinate and keeps incorrectly producing a non-empty prediction. Note that, if the target ground-truth mask persists through the final frame of the sequence/volume, then no post-$t_{\text{last}}$ frames exist and $L=0$ by definition (e.g., the CAMUS dataset). We summarize $L$ via boxen plots and empirical cumulative distribution functions, and report percentiles such as the 90th percentile ($P_{90}$), which indicates the over-propagation length below which 90\% of cases fall.




\section{Results}
\subsection{Prompt-Frame Accuracy}
\label{sec:results_promptframe}
To isolate the effect of prompt interpretation, defined here as the model's ability to accurately resolve the spatial extent of the target structure on the initial frame based on user input, we measured segmentation performance on the prompt-frame only. While detailed numerical results for all 54 anatomical structures are provided in Appendix~\ref{sec:appendix_promptframe} (Table~\ref{tab:prompt_frame_dsc}), Figure~\ref{fig:flooding_figure} summarizes two key aspects: (a) the distribution of the prediction-to-ground-truth area ratio $R$ for each prompt type on a log scale, and (b) the fraction of cases with severe over-segmentation ($R>2$) stratified by object ground-truth size.
\input{figures/figure_flooding}

Across all structures and prompt types, SAM~3 provides markedly stronger and more stable initialization than SAM~2. Under click prompting, SAM~2 exhibits a severe instability manifested as a heavy-tailed distribution in Figure~\ref{fig:flooding_figure}a: its predicted masks are frequently $10\times$ to $10^5\times$ larger than the ground truth. Specifically, the median area ratio for single-click prompts is $9.9\times$ for SAM~2 versus $2.6\times$ for SAM~3; for multi-click prompts, the medians drop to $3.4\times$ and $2.0\times$, respectively, and for bounding boxes they are close to unity at $1.16\times$ (SAM~2) and $1.11\times$ (SAM~3). Thus, even under sparse clicks, SAM~3 keeps the predicted area much closer to the ground-truth support, whereas SAM~2 frequently floods large portions of the frame. 

Figure~\ref{fig:flooding_figure}b quantifies the frequency of the severe over-segmentations ($R>2$). For single-click prompting on small targets ($<500$ pixels), 79\% of SAM~2 initializations are severely over-segmented, compared with 69\% for SAM~3; for medium-sized structures (500--2k pixels), the gap widens to 53\% vs.\ 19\%, and even for the largest objects ($\geq 10$k pixels), severe over-segmentation occurs in 90\% of SAM~2 cases but only 44\% of SAM~3 cases. Multi-click prompting reduces these failure rates for both models, but SAM~2 still shows substantially higher severe-error frequencies (e.g., 73\% vs.\ 65\% in the smallest bin and 49\% vs.\ 13\% in the 2k--10k bin). Bounding-box prompts largely suppress over-segmentation in both models, with severe over-segmentation falling below $\sim$7\% cases for SAM 2 and $\sim$4\% cases for SAM 3 across all size bins. In this strong-prompt regime, the initialization advantage of SAM~3 becomes modest, confirming that SAM~2's instability is primarily a sparse-prompt phenomenon.


\subsection{Full-Volume/Sequence Segmentation Accuracy}
\input{tables/table_fullvolume}
While prompt-frame accuracy captures initialization quality, clinical applications require accurate segmentation across full 3D volumes or complete temporal sequences. Full-volume DSC, therefore, reflects the combined effect of both initialization and propagation under SAM~2's memory-based architecture and SAM~3's redesigned tracking pathway. Table~\ref{tab:sam2_vs_sam3_allprompts} summarizes structure-wise performance across all prompting regimes.

Across modalities, a consistent pattern emerges. Under sparse guidance (single- and multi-click prompting), SAM~3 generally achieves higher full-volume DSC than SAM~2 for most targets, indicating that its stronger prompt-frame initialization translates into better sequence-level performance, especially for structures such as vessels, gastrointestinal segments, and cardiac chambers. As prompt strength increases to bounding boxes and masks, this global advantage narrows: both models approach similar accuracy for many large, well-contrasted organs, and performance instead splits by anatomical type.

In this stronger-prompt regime, SAM~2 is frequently more competitive or superior for several organs like kidneys, spleen, bladder, reflecting more conservative propagation once a reliable initialization is provided. In contrast, SAM~3 retains an advantage for low-contrast, highly deformable, or tubular anatomy, where tracking stability is more challenging. Representative failure cases, such as MR bladder and SegThy thyroid/vascular targets, illustrate that excellent prompt-frame DSC can still collapse to near-zero full-volume DSC for one model while the other maintains stable masks. Given SegThy's click-prompt breakdown and counterintuitive propagation behavior for both models under stronger prompts, we provide a deeper dataset-specific analysis of the failure patterns and why full-volume DSC can remain non-trivial despite near-zero prompt-frame DSC in Appendix~\ref{sec:appendix_segthy}. These discrepancies foreshadow the retention and over-propagation behaviour quantified by the failure-mode analysis.

\subsection{Failure-Mode Analysis: Temporal Retention and Over-Propagation}
\label{subsec:failure_mode_results}
Volume-level DSC aggregates initialization and propagation into a single number, but interactive workflows care about how masks evolve over time. Here we examine two temporal failure modes defined in Section~\ref{sec:failure_methods}: (i) \emph{retention}, i.e., how quickly a well-initialized mask drifts or degrades while the object is still present, and (ii) \emph{over-propagation}, i.e., how long a model continues to hallucinate a mask after the object has disappeared.
\input{figures/figure_retention}
\input{figures/figure_overprop}
\paragraph{Retention of well-initialized objects.}
To isolate propagation behaviour from pure initialization failures, we restrict this analysis to cases with good starting masks (prompt-frame $DSC \ge 0.7$), so that the decay slopes primarily reflect how well each model maintains a reasonable segmentation rather than how quickly an already-bad mask collapses. For completeness, we also report the same analysis computed over all cases (including poor initializations with prompt-frame $DSC<0.7$) in Appendix~\ref{sec:appendix_retention_alldata} (Figure~\ref{fig:retention_alldata}).

Figure~\ref{fig:retention} summarizes retention for cases with good initialization (prompt-frame $DSC \ge 0.7$). The boxen plots show the distribution of normalized decay slopes $\beta$ for each prompt type, where more negative values correspond to faster loss of accuracy from the first to the last frame. The bar plot reports mean slopes by prompt type, and the ECDF curves show, for any threshold on $\beta$, what fraction of cases have decay no worse than that value; the annotated $P_{50}$ markers indicate the median slope (half of the cases decay faster, half more slowly). 

Across all well-initialized cases, both models exhibit negative normalized decay slopes on average, indicating that segmentation quality tends to deteriorate as the object evolves (Figure~\ref{fig:retention}). However, SAM~3 consistently forgets more slowly. Under single-click prompts, the mean decay slope is $-0.215$ for SAM~2 versus $-0.134$ for SAM~3, and the median slopes ($P_{50}$) are $-0.096$ and $-0.054$, respectively, implying that a typical SAM~2 sequence loses roughly twice as much DSC over its lifespan as a typical SAM~3 sequence. Multi-click prompts show a similar pattern (mean slopes $-0.233$ vs.\ $-0.139$; medians $-0.111$ vs.\ $-0.045$). The difference widens with stronger prompts: for bounding boxes, mean slopes are $-0.286$ (SAM~2) and $-0.173$ (SAM~3), with medians of $-0.161$ and $-0.075$; for masks, the means are $-0.320$ vs.\ $-0.206$ and medians $-0.192$ vs.\ $-0.114$. In the ECDFs, SAM~3's curves are consistently shifted toward less negative values, indicating that, conditional on a good start, SAM~3 maintains segmentation quality better across all prompt types.

\paragraph{Over-propagation after object disappearance.}
Figure~\ref{fig:overprop} reveals the cost of SAM~3's retention stability: a tendency to be ``sticky." The distributions of over-propagation length highlight how many frames each model continues to predict foreground after the last ground-truth frame, the stacked bars group volumes into none/minor/moderate/severe hallucination (0, 1--10, 11--50, $>50$ frames), and the ECDF curves describe the cumulative distribution of hallucinated length; the annotated $P_{90}$ gives the number of frames below which 90\% of cases fall. Across prompt settings, the distribution is dominated by short or zero over-propagation for many cases, but a small fraction of failures persist for much longer durations, motivating tail-focused summaries in addition to $P_{90}$.

Under single-click (1,0) prompts, both models behave similarly: about 35\% of SAM~2 volumes and 33\% of SAM~3 volumes terminate perfectly with zero over-propagation, and the $P_{90}$ values are comparable (79 vs.\ 76 frames). This similarity also holds in central tendency (mean/median $L=27.8/5$ for SAM~2 vs.\ $26.9/7$ for SAM~3), while rare long failures remain (approximately $P_{99}=243$ vs.\ $227$ frames; $\sim$7.7\% vs.\ $\sim$7.1\% of cases exceed 100 over-propagated frames). The maximum observed failure is $L=533$ frames, occurring in the CholecSeg8K dataset (Cystic Duct) under single-click prompting, and is observed for both models on the same case. With multi-click prompts, SAM~2 becomes slightly more conservative, with about 43\% of volumes showing no over-propagation compared with about 30\% for SAM~3, and $P_{90}$ dropping to 62 frames for SAM~2 versus 78 frames for SAM~3. Consistent with this shift, the bulk of the distribution tightens for SAM~2 (mean/median $20.5/2$; $P_{99}\approx215$; $\sim$5.0\% $>100$ frames), while SAM~3 retains a heavier tail (mean/median $27.7/8$; $P_{99}\approx228$; $\sim$7.4\% $>100$ frames).

The contrast is sharper once strong visual prompts is provided. For bounding-box prompts, 54\% of SAM~2 volumes exhibit no over-propagation, compared with only 34\% for SAM~3, and severe tails of more than 50 hallucinated frames occur in 6.5\% of SAM~2 cases but 15.3\% of SAM~3 cases; the corresponding $P_{90}$ values are 34 vs.\ 72 frames. Within the $>50$ ``severe'' category, long failures remain more frequent for SAM~3 under bounding-box prompting: $\sim$1.74\% (SAM~2) vs.\ $\sim$6.56\% (SAM~3) exceed 100 over-propagated frames, with $P_{99}\approx139$ vs.\ $\approx212$ frames. Mask prompting shows a similar trend: roughly 54\% of SAM~2 volumes versus 34\% of SAM~3 volumes have zero over-propagation, while severe tails appear in 7.2\% vs.\ 15.5\% of cases and $P_{90}$ increases from 37 frames (SAM~2) to 73 frames (SAM~3). In this case, $\sim$2.11\% (SAM~2) vs.\ $\sim$6.56\% (SAM~3) exceed 100 frames, with $P_{99}\approx158$ vs.\ $\approx218$ frames.

Taken together with the prompt-frame over-segmentation analysis, these failure-mode results highlight a complementary trade-off between the models. SAM~3 offers more reliable initialization and better retention for well-initialized objects, particularly under stronger prompts, but is more ``sticky'' and prone to long-lived hallucinated masks after the object disappears. SAM~2 is less capable under sparse prompts and struggles with most targets, yet it tends to terminate tracks earlier and exhibits fewer extreme over-propagation failures under bounding-box and mask prompting.

\subsection{Performance Behavior as a Function of Prompt Strength}
Across modalities, both models follow a consistent pattern as prompt strength increases from single-click to multi-click, bounding-box, and mask prompts. Under sparse guidance (clicks), SAM~3 dominates because it interprets minimal prompts more reliably, leading to higher prompt-frame DSC, fewer prompt-frame over-segmentation failures, and better temporal retention. As prompts become more informative and provide explicit spatial support, the global advantage narrows and performance becomes modality and structure-dependent: SAM~2 is often competitive or better on targets such as kidneys, spleen, and bladder under bounding-box/mask prompting in CT/MRI, whereas SAM~3 more consistently leads on several vessel and tract targets (e.g., aorta/IVC/portal-venous structures and GI tract) and shows strong gains on challenging ultrasound targets (e.g., SegThy thyroid). Qualitative examples illustrating these regimes are shown in Figures~\ref{fig:sam3better} and~\ref{fig:sam2better}.

\input{figures/figure_sam3better}
\input{figures/figure_sam2better}

\subsection{Overall Interpretation and Summary of Findings}
Taken together, the prompt-frame, full-volume, and failure-mode evaluations show that SAM~2 and SAM~3 offer complementary strengths rather than a single performance hierarchy, driven by a trade-off between prompt interpretation (what to segment) and temporal consistency (how well it is remembered). We summarize our findings as follows:

\begin{itemize}
    \item \textbf{Initialization advantages for SAM~3.}  
    Under click prompts, SAM~3 has a clear advantage: its unified perception encoder infers structure from minimal input, yielding higher prompt-frame DSC and substantially fewer flooding failures than SAM~2 across most targets.

    \item \textbf{Propagation trade-offs under strong visual prompts.}
    Under bounding-box or mask prompts, SAM~2 is often competitive or better on kidneys, spleen, and bladder in CT/MRI, and it more frequently terminates tracks without long over-propagation failures. SAM~3 remains stronger on several vessel/tract targets (e.g., aorta/IVC/portal-venous structures and GI tract) and on challenging ultrasound targets such as SegThy thyroid, but it more often exhibits longer over-propagation tails.

    \item \textbf{The ``unreliable propagator'' risk.} 
    High initialization accuracy does not guarantee successful propagation. In several datasets (e.g., MR bladder, SegThy ultrasound), one model attains excellent prompt-frame DSC but then collapses or hallucinates for many frames. This highlights the need to evaluate temporal retention and over-propagation beyond prompt-frame or volume-averaged DSC.
\end{itemize}

Overall, SAM~3 is the natural default for interactive zero-shot segmentation on 3D medical data. Across modalities, it is markedly more reliable under sparse click prompts due to stronger prompt interpretation and more stable retention, and these advantages frequently translate into better full-volume performance. As the visual prompt becomes stronger (bounding box or mask), the performance gap narrows for many targets; however, SAM~3 remains competitive in this regime as well, and continues to lead on a broad set of structures. The main exceptions are a smaller subset of targets under bounding-box or mask initialization where SAM~2 achieves higher Dice and behaves more conservatively with fewer long over-propagation tails.


\section{Conclusion}
This work presents the first large-scale, controlled comparison of SAM~2 and SAM~3 for zero-shot segmentation of 3D medical data under identical visual prompting. By evaluating single-click, multi-click, bounding-box, and mask initialization across sixteen datasets and 54 anatomical structures, we disentangle how architectural changes in SAM~3 affect prompt interpretation, temporal retention, and failure behaviour relative to SAM~2.

Our results show that SAM~3 offers markedly stronger prompt interpretation: it delivers higher prompt-frame DSC, substantially fewer over-segmentation failures, and slower temporal decay of prediction mask for well-initialized objects, especially under click and bounding-box prompts. SAM~2, however, remains a competitive and often preferable choice for some organs like kidney, gallbladder, spleen in CT/MRI under strong visual prompts, where its propagation is more conservative and less prone to long-lived hallucinated masks. The failure-mode analysis highlights that high initialization accuracy alone is not sufficient: models can still suffer catastrophic collapse or prolonged over-propagation, underscoring the need to explicitly evaluate temporal retention and termination behaviour.

Overall, our findings position SAM~3 as the stronger default backbone for broad 3D medical segmentation workflows, while clarifying scenarios in which SAM~2 remains the safer propagator for specific organ types and prompt regimes. A key limitation of this study is that we restrict the comparison to purely visual prompts, deliberately disabling the concept- and text-based mechanisms introduced in SAM~3. As vision--language approaches such as Voxtell~\cite{rokuss2025voxtell} gain traction for open-vocabulary 3D medical segmentation, extending our framework to include semantic prompting and language-guided concepts, with SAM~3's full capabilities enabled, is an important direction for future work.


\bibliography{midl26_26}

\clearpage
\appendix

\input{appendix/appdx_supervised_baselines}
\input{appendix/appdx_sam3_pvs_details}
\input{appendix/appdx_sam2_vs_sam3}
\input{appendix/appdx_random_click}
\input{appendix/appdx_preproc_details}
\input{appendix/appdx_prompt-frame_results}
\input{appendix/appdx_segthy_failure}
\input{appendix/appdx_retention_on_all_data} 

\end{document}
