RoboMME: Benchmarking and Understanding Memory for Robotic Generalist Policies

Published: 27 May 2026, Last Modified: 04 Jun 2026FMEA @ CVPR 2026 OralEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Robot Manipulation, Benchmark, Memory-Augmented Policy, Vision-Language-Action Models
TL;DR: A large-scale robotic manipulation benchmark designed for history-dependent tasks, paired with a comprehensive study of memory-augmented vision–language–action models.
Abstract: Memory is critical for long-horizon and history-dependent robotic manipulation. Such tasks often involve counting repeated actions or manipulating objects that become temporarily occluded. Recent vision-language-action (VLA) models have begun to incorporate memory mechanisms; however, their evaluations remain confined to narrow, non-standardized settings. This limits their systematic understanding, comparison, and progress measurement. To address these challenges, we introduce **RoboMME**: a large-scale standardized benchmark for evaluating and advancing VLA models in long-horizon, history-dependent scenarios. Our benchmark comprises 16 manipulation tasks constructed under a carefully designed taxonomy that evaluates *temporal*, *spatial*, *object*, and *procedural* memory. We further develop a suite of 14 memory-augmented VLA variants built on the $\pi_{0.5}$ backbone to systematically explore different memory representations across multiple integration strategies. We show that the effectiveness of memory representations is highly task-dependent, with each design offering distinct advantages and limitations across different tasks. Videos and code can be found at https://anonymtest1.github.io/. This paper has been accepted by other conferences.
Submission Number: 30
Loading