SAM2Act: Integrating Visual Foundation Model with A Memory Architecture for Robotic Manipulation

Published: 16 Sept 2025, Last Modified: 16 Sept 2025CoRL 2025 SpotlightEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Robotics Manipulation, Multiview Robotics Transformer, Imitation Learning, Memory-based Architecture, Behavior Cloning, Generalization
TL;DR: Memory-based Multi View Robotics Transformer
Abstract: Robotic manipulation systems operating in diverse, dynamic environments must exhibit three critical abilities: multitask interaction, generalization to unseen scenarios, and spatial memory. While significant progress has been made in robotic manipulation, existing approaches often fall short in generalization to complex environmental variations and addressing memory-dependent tasks. To bridge this gap, we introduce **SAM2Act**, a multi-view robotic transformer-based policy that leverages multi-resolution upsampling with visual representations from large-scale foundation model. SAM2Act achieves a state-of-the-art average success rate of **86.8% across 18 tasks** in the RLBench benchmark, and demonstrates robust generalization on *The Colosseum* benchmark, with only a **4.3% performance gap** under diverse environmental perturbations. Building on this foundation, we propose **SAM2Act+**, a memory-based architecture inspired by SAM2, which incorporates a memory bank, an encoder, and an attention mechanism to enhance spatial memory. To address the need for evaluating memory-dependent tasks, we introduce ***MemoryBench***, a novel benchmark designed to assess spatial memory and action recall in robotic manipulation. SAM2Act+ achieves an average success rate of **94.3% on memory-based tasks** in *MemoryBench*, significantly outperforming existing approaches and pushing the boundaries of memory-based robotic systems.
Supplementary Material: zip
Submission Number: 4
Loading