\section{Failure analysis of SegThy (3D Ultrasound) dataset}
\label{sec:appendix_segthy}

SegThy is a 3D ultrasound sweep dataset and is uniquely challenging because it combines two difficulty sources that are usually separated across modalities. In ``volume-like" medical data (e.g., modalities like CT and MR), targets may first appear as tiny cross-sections near the boundary of a scan on the prompt-frame, but stable intensity statistics and sharper boundaries typically make sparse prompting tractable once the object emerges. On the other hand, in ``sequence-like" medical data (e.g., CAMUS cine ultrasound), appearance is dominated by speckle and weak edges, but typically the targets are temporally continuous and occupy substantial area from the prompt-frame and across the entire sequence, providing immediate spatial support for initialization as well as continued support for propagation. SegThy inherits the prompt-frame structure ambiguity of volume-like data and the speckle-dominated, low-contrast appearance of ultrasound, so prompts must resolve the object when it is both visually ambiguous and minimally represented in pixels.

\begin{figure}[!htbp]
    \centering
    \includegraphics[width=\linewidth]{figures/segthy_failure.pdf}
    \caption{\textbf{SegThy click-prompt failures are driven by extreme small-target prompt-frames.}
    Two example cases under a single-click (1,0) prompt are shown. In Example 1, SAM~2 exhibits prompt-frame flooding and remains poorly localized across the volume (near-zero DSC), while SAM~3 is initially weak but recovers once the target occupies a larger cross-section, achieving high overlap on mid-volume slices. In Example 2, both models have low prompt-frame DSC at first appearance but track accurately on subsequent slices once sufficient pixel support emerges, illustrating that the dominant brittleness is concentrated at initialization on the prompt-frame.
    [Colors: \sethlcolor{gtMask}\hl{\texttt{GT}}, \sethlcolor{samTwoBase}\hl{\texttt{SAM 2}}, \sethlcolor{samThreeBase}\hl{\texttt{SAM 3}}]}
    \label{fig:segthyfailure}
\end{figure}

In our exploration of SegThy, we found that the dominant driver of click-prompt failure is the extreme small size of the ground-truth target on the prompt-frame. Under our protocol, the prompt-frame is the first slice where the structure appears, and in SegThy this first-appearance cross-section is often only a tiny fraction of the structure’s typical extent in the same volume. Quantitatively, the ground-truth area on the prompt-frame is only $0.013\times$ the mean ground-truth area over the remaining GT-present slices in that volume, i.e., the target is typically $\sim77\times$ smaller on the prompt-frame than on a typical slice later in the volume. In $90.5\%$ of SegThy cases, the prompt-frame ground-truth area lies within the smallest $1\%$ of GT-present slices in the volume. In comparison, for sequence-like datasets such as CAMUS cine ultrasound, this effect is not prevalent: the prompt-frame target size is comparable to the sequence average (median ratio $\approx 1.02$). 

This extreme small-target regime impacts click prompting in two ways. First, segmentation is genuinely difficult on the prompt-frame: when the structure occupies only a few dozen pixels and is embedded in speckle-dominated texture with weak edges, sparse click prompts ((1,0) and (1,2)) often provide insufficient spatial evidence to disambiguate the target from surrounding tissue, so failures frequently begin at initialization. Second, small structures make DSC intrinsically unforgiving on the prompt-frame: even modest over-segmentation or a slight spatial offset can drive overlap close to zero when the ground truth is tiny. Consistent with this, prompt-frame predictions are frequently much larger than the ground truth in SegThy (median $\approx 581\times$ predicted-to-GT pixel count), and prompt-frame DSC under clicks is near-zero; $70.5\%$ of cases have prompt-frame DSC $<0.01$.

A counterintuitive pattern in SegThy is that full-volume DSC can be noticeably higher than prompt-frame DSC even when the initial slice is essentially missed. As the volume progresses, the target typically grows in cross-sectional area and becomes more separable from the background, so mid-to-late slices can contribute substantially more to the sequence-average DSC than the earliest first-appearance slices. Because full-volume DSC averages across hundreds of frames, later slices with larger targets can dominate the aggregate even when prompt-frame overlap is negligible. Figure~\ref{fig:segthyfailure} shows two representative SegThy volumes where the target is extremely small on the prompt-frame, leading to near-zero click initialization but delayed recovery in performance after the target grows in later slices.

SegThy is also challenging due to its long temporal extent (Table~\ref{tab:dataset-description}), which increases the opportunity for drift and error accumulation under persistent speckle and weak edges. Stronger prompts (bounding boxes or masks) largely remove the low-evidence initialization failure on the prompt frame, but propagation through hundreds of slices can still collapse after a good start. In our runs this long-horizon degradation is most visible for SAM~2, whereas SAM~3 is generally more resilient and maintains higher sequence-level DSC under strong prompts. Overall, SegThy represents a modality--geometry corner case in which click prompts are destabilized by extreme small targets on the prompt-frame, and long-horizon tracking remains difficult even with accurate initialization.
