Memory-Enhanced Temporal Learning: Leveraging SAM2’s Memory Modules for Consistent Video Segmentation on Surgical Video

Published: 21 Jul 2025, Last Modified: 20 Aug 2025MSB EMERGE 2025 ConditionalrequiresmajorrevisionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Video Segmentation, Endoscopic Video, Temporal Consistency
TL;DR: We propose TMAM, a module that transfers SAM2's memory mechanism to existing 2D segmentation models, enable to produce smoother and more consistent video segmentations without requiring point prompts.
Abstract: Video segmentation is critical for many medical imaging applications; however, most mainstream segmentation models process each frame independently, often resulting in inconsistent segmentation masks across consecutive frames. Existing video segmentation models incorporate temporal cues but require densely annotated, large-scale datasets. Although the recently proposed Segment Anything Model 2 (SAM2) has demonstrated promising segmentation capabilities with its memory mechanism, applying SAM2 in clinical settings is challenging due to its reliance on user prompts. To address these issues, we introduce the Temporal Memory Augmentation Module (TMAM). TMAM adapts any pre-trained 2D segmentation model by encoding past-frame predictions via SAM2’s memory encoder and applying memory attention to refine current-frame features. By leveraging temporal redundancy in video sequences, TMAM captures contextual cues that may be overlooked by single-frame processing, thereby improving robustness to occlusions and boundary artifacts. Experiments on public surgical video datasets demonstrate that TMAM enhances Dice scores and temporal consistency across various base architectures. These results highlight TMAM’s ability to produce smoother, more coherent segmentations, paving the way for more reliable video analysis in surgical image navigation systems and robotic surgery, where precise and consistent segmentation is essential.
Camera Ready Submission: zip
Submission Number: 6
Loading