\section{Preprocessing Details}
\label{sec:appendix_preproc}

All datasets were converted into a unified slice-based format to enable consistent evaluation across models and modalities. For CT datasets, images were first windowed using clinically standard ranges (e.g., soft-tissue windowing with level--width of 40/400 for abdominal CT and lung windowing of $-600/1500$ for thoracic CT) before clipping and rescaling to $[0,255]$. MRI volumes were normalized by extracting the intensity values between the 0.5th and 99.5th percentiles within each volume and linearly rescaling this clipped range to $[0,255]$, ensuring robustness to modality-specific dynamic range differences. Ultrasound images were min--max normalized per sequence and similarly rescaled to $[0,255]$. Endoscopy datasets provided color-coded semantic masks, which were converted into per-class binary masks via RGB-to-class lookup. No smoothing, interpolation, or artifact removal was applied. {All inputs are provided at native resolution and internally resized by the official code before encoding ($1024\times1024$); we follow the authors' released inference pipelines to ensure a faithful and reproducible comparison. This standardized preprocessing ensures consistent inputs across modalities, with any model-specific resizing handled internally by the official pipelines.}

All experiments use publicly released checkpoints without any fine-tuning. For SAM~2, we use the SAM~2.1 Hiera-B+ checkpoint. The SAM~3 release does not specify multiple model variants in the paper, and we therefore adopt the standard configuration provided by the authors. All evaluations were performed on NVIDIA H100 GPUs.