Keywords: Imitation learning, Short-term memory
Abstract: Many robotic tasks demand short-term memory, whether it’s retrieving objects that are no longer visible or turning off an appliance after a certain amount of time. Yet, most visuomotor policies remain myopic, relying only on immediate sensory input without leveraging past experiences to guide decisions. We present PRISM, a transformer-based architecture for visuomotor policies to effectively use short-term memory via two key components: (i) gated attention, which selectively filters retrieved information to suppress irrelevant details, and (ii) a hierarchical architecture that first compresses local interactions into compact tokens and then integrates them to capture temporally extended dependencies. Together, these mechanisms enable us to scale short-term memory in visuomotor policies for up to two minutes at five frames per second, an order of magnitude longer than previous approaches. To systematically evaluate memory in visuomotor control, we introduce ReMemBench—a benchmark of eight diverse household manipulation tasks spanning four categories of short-term memory—designed to foster general memory mechanisms rather than siloed, task-specific solutions. PRISM consistently outperforms prior works, including transformer-based visuomotor policies with short-term memory, recurrent architectures, and other short-term memory-management strategies. In ReMemBench and real-world evaluation, PRISM achieves an absolute improvement of 11–15 points over the strongest baseline. On RoboCasa and LIBERO, it further yields 11–14-point gains over its no-memory variant and outperforms strong fine-tuned VLA baselines such as GR00T-N1-3B and OpenVLA, without any pretraining. Together, PRISM and ReMemBench establish a foundation for developing and evaluating short-term memory–augmented visuomotor policies that scale to long-horizon tasks.
Primary Area: applications to robotics, autonomy, planning
Submission Number: 23702
Loading