\section{SAM~2 vs. SAM~3 under visual prompting}
\label{sec:appendix_sam2_sam3_arch}
This appendix summarizes the SAM~2 vs.\ SAM~3 design differences that are most relevant under the PVS regime studied in this paper, where the same visual-prompt protocol is applied and performance is assessed through long-horizon propagation over 3D medical volumes/cines. The behaviors quantified in the main text are governed by two coupled factors: the learned per-frame visual representation available to the tracking stack (which shapes how sparse prompt evidence is resolved on the initialization frame) and the temporal conditioning dynamics induced by the propagation/memory formulation (which shapes retention, drift, and termination after disappearance). 

In SAM~2, per-frame representations are produced by a Hiera backbone and propagation is implemented as memory-conditioned decoding over a streaming memory bank: a memory encoder writes features/masks into memory and a memory-attention module conditions current-frame decoding on this bank during propagation~\cite{ravi2024sam,ryali2023hiera}. 

In SAM~3, the tracker consumes Perception Encoder (PE) features that are shared with the detector/tracker stack in the official design, and the overall system adopts a detector--tracker factorization in which a DETR-style detector with learned object queries provides object-centric localization/association while the tracker inherits the SAM~2 transformer encoder--decoder for video segmentation and interactive refinement~\cite{carion2025sam3segmentconcepts,carion2020end}. Importantly, for the PVS setting, the tracker in SAM~3 retains a SAM~2-style propagation mechanism with a memory encoder and memory bank~\cite{carion2025sam3segmentconcepts}; however, even when both models employ memory banks, the effective temporal update dynamics can differ because the tracker operates on different learned features (Hiera vs.\ PE) and because the detector--tracker decomposition introduces stronger object-centric identity and association priors that couple to how state is written, retrieved, and maintained over time. Consequently, the balance between prompt evidence and model priors on the initialization frame, as well as long-horizon stability and termination behavior, can differ across SAM~2 and SAM~3 under identical visual prompting.

These architectural differences relate to the three error modes quantified in the main text. Prompt-frame over-segmentation is most sensitive to the per-frame feature space and the initialization pathway, since sparse prompts must be resolved into a precise object extent on the prompt frame. Temporal retention is most sensitive to temporal conditioning and memory updates while the target remains present, since small initialization errors may be reinforced or corrected depending on what is written to memory and how strongly it is reused during propagation. Over-propagation after object disappearance is most sensitive to termination behavior and persistence of tracker/memory state after visual evidence vanishes, where differences in identity persistence induced by the tracker formulation can translate into different stopping characteristics. Since the visual prompting protocol is held constant across models in our study, differences in model behavior are most naturally interpreted as consequences of differences in learned visual representations, initialization pathway, and long-horizon propagation/termination dynamics between SAM~2 and SAM~3 under the same PVS interaction protocol.
