MST-SAM: Bridging Multi-View Gaps in SAM2 with Spatiotemporal Bank

19 Sept 2025 (modified: 14 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: SAM, Spatiotemporal Bank, multi-camera system.
Abstract: High-quality, instance-level segmentations are crucial for developing multi-view vision-centric systems, such as self-driving vehicles and mobile robots, yet their annotation acquisition is prohibitively expensive. While human-in-loop labelling paradigms like SAM2 show great promise in monocular videos, adapting them to multi-cameras scenarios is hindered by two fundamental flaws: spatially, an ignorance of cross-view geometry leads to severe tracking ambiguity; and temporally, the exponential memory demands preclude real-time performance. To address these challenges, we propose MST-SAM, a novel streaming framework for robust, multi-view instance segmentation and tracking through spatio-temporal bank. Our method introduces two core components: (1) a Spatio-Positional Augmentation (SPA) module that bridges SAM2’s 2D-centric design with 3D scene geometry. It learns a unified positional prior from camera transformations, enabling tokens to reason about their absolute spatial location across different views. (2) a Memory View Selection (MVS) strategy that prunes the temporal memory bank, significantly reducing the computational overhead of the multi-view system while maintaining high algorithm performance. We validate our method on the nuScenes and Waymo datasets using a custom multi-view instance segmentation benchmark we introduce, where MST-SAM sets a new state of the art and demonstrates strong generalization.
Supplementary Material: zip
Primary Area: applications to robotics, autonomy, planning
Submission Number: 15424
Loading