\section{Retention on all data}
\label{sec:appendix_retention_alldata}

In the main text (Section~\ref{subsec:failure_mode_results}) we focused on retention behavior for the subset of cases with good initialization ($DSC \ge 0.7$ on the prompt frame). For completeness, Figure~\ref{fig:retention_alldata} extends the same analysis to all volume--object pairs, including those with poor initialization. As expected, the distributions of normalized decay slopes broaden for both models, particularly under click prompting where failures at the first frame lead to rapid apparent decay. Under single-click (1,0) prompts, the mean slopes of SAM~2 and SAM~3 are nearly identical (approximately $-0.071$ vs.\ $-0.073$) and the median slopes are close to zero ($-0.006$ vs.\ $-0.028$), reflecting that most degradation is driven by a minority of highly unstable cases. For multi-click prompting, SAM~3 retains an advantage (mean slopes $-0.130$ vs.\ $-0.085$; medians $-0.059$ vs.\ $-0.030$). The differences become more pronounced under bounding-box and mask prompts: SAM~2 exhibits substantially more negative mean slopes (about $-0.262$ and $-0.309$) than SAM~3 (about $-0.152$ and $-0.201$), and the ECDFs show that SAM~3’s decay distributions are consistently shifted toward less negative values. Overall, when poor initializations are included, SAM~3 continues to forget more slowly than SAM~2 once a sufficiently strong spatial prompt is provided.

\begin{figure}[!htbp]
    \centering
    \includegraphics[width=0.95\linewidth]{figures/retention_figure_alldata.pdf}
    \caption{\textbf{Retention decay analysis on all cases.}
    Same layout as Figure~\ref{fig:retention}, but computed over all volume--object pairs, including those with poor initialization ($DSC < 0.7$).
    The distributions broaden for both models, especially under click prompting, yet SAM~3 generally maintains less negative mean decay slopes for multi-click, bounding-box, and mask prompts.}
    \label{fig:retention_alldata}
\end{figure}
