\section{SAM~3 PVS configuration details and implications}
\label{sec:appendix_sam3_disable_concepts}

This appendix documents the SAM~3 configuration used in our experiments and clarifies how it follows the usage described in the SAM~3 paper. SAM~3 distinguishes Promptable Visual Segmentation (PVS), where targets are specified by visual prompts (points, boxes, masks), from concept-driven prompting mechanisms that use semantic inputs~\cite{carion2025sam3segmentconcepts}. Our evaluation uses the PVS setting so that SAM~2 and SAM~3 are compared under the same interaction pattern: a first-frame visual prompt followed by forward propagation through the sequence.

We implement SAM~3 in PVS by using the authors’ released tracker-based visual-prompt interface, i.e., by selecting the corresponding inference path rather than modifying parameters or introducing custom bypasses. Concretely, for all SAM 3 inferences done in this work, we build upon the official SAM~3 implementation that exposes a SAM~2-style video task interface\footnote{\url{https://github.com/facebookresearch/sam3/blob/11dec2936de97f2857c1f76b66d982d5a001155d/examples/sam3_for_sam2_video_task_example.ipynb}}. For all SAM 2 inferences, we use the official SAM~2 VOS script as the reference \footnote{\url{https://github.com/facebookresearch/sam2/blob/2b90b9f5ceec907a1c18123530e92e794ad901a4/tools/vos_inference.py}}. Under this implementation of SAM 3, we provide only visual prompts on the initialization frame and run the released forward propagation loop; concept prompts (text or exemplar inputs) are not provided, and concept/PCS handlers are not invoked. As a result, predictions are determined by SAM 3's Perception Encoder features, the prompt-conditioned initialization on the prompt frame, and the tracker’s temporal propagation and memory updates over subsequent frames.

This choice keeps the comparison controlled and aligned with the intended PVS usage: both SAM~2 and SAM~3 operate from the same class of visual inputs and the same interaction protocol, and the evaluation isolates differences in visual prompt interpretation and propagation dynamics without introducing additional implementation degrees of freedom. Over-propagation is measured from the masks produced by this standard PVS propagation after the final ground-truth frame, and therefore reflects termination behavior of the tracker under visual-only interaction, rather than behavior induced by any custom parameter masking or auxiliary semantic inputs.
