TL;DR: Memory-based Multi View Robotics Transformer
Abstract: Robotic manipulation systems operating in diverse, dynamic environments must exhibit three critical abilities: multitask interaction, generalization to unseen scenarios, and spatial memory. While significant progress has been made in robotic manipulation, existing approaches often fall short in generalization to complex environmental variations and addressing memory-dependent tasks. To bridge this gap, we introduce **SAM2Act**, a multi-view robotic transformer-based policy that leverages multi-resolution upsampling with visual representations from large-scale foundation model. SAM2Act achieves a state-of-the-art average success rate of **86.8% across 18 tasks** in the RLBench benchmark, and demonstrates robust generalization on *The Colosseum* benchmark, with only a **4.3% performance gap** under diverse environmental perturbations. Building on this foundation, we propose **SAM2Act+**, a memory-based architecture inspired by SAM2, which incorporates a memory bank, an encoder, and an attention mechanism to enhance spatial memory. To address the need for evaluating memory-dependent tasks, we introduce ***MemoryBench***, a novel benchmark designed to assess spatial memory and action recall in robotic manipulation. SAM2Act+ achieves an average success rate of **94.3% on memory-based tasks** in *MemoryBench*, significantly outperforming existing approaches and pushing the boundaries of memory-based robotic systems.
Project page: [sam2act.github.io](https://sam2act.github.io/).
Lay Summary: Robots that work in real-world settings need to do three things well:
1. **Handle new situations** they haven’t seen before.
2. **Do precise tasks well**.
3. **Remember where things are** while they work.
Most current robot systems still struggle, especially with memory.
Our team built a new system called **SAM2Act** that helps robots see their surroundings from several camera angles, understand what they’re looking at, and act accordingly. In tests covering 18 household-style tasks (like stacking blocks or opening a drawer), SAM2Act completed nearly nine out of ten attempts successfully. Even when we changed the lighting, object colors, and other conditions, its performance dropped by only about four percent. We then added a “memory bank” so the robot could store and recall visual snapshots while it moves. The upgraded version, **SAM2Act+**, lets the robot remember where objects were a few moments ago, crucial for tasks such as picking up an item it set aside earlier.
Because no standard test existed for this kind of memory, we created *MemoryBench*, a new set of challenges that measure how well a robot can remember and act on past observations. SAM2Act+ topped this benchmark, showing that giving robots a working memory can make them far more reliable.
To learn more about our project and view demonstrations, please visit our project page: [sam2act.github.io](https://sam2act.github.io/).
Application-Driven Machine Learning: This submission is on Application-Driven Machine Learning.
Link To Code: https://github.com/sam2act/sam2act
Primary Area: Applications->Robotics
Keywords: Robot Learning, Behavior Cloning, Imitation Learning, Memory-based Robotics Transformer
Submission Number: 93
Loading