Keywords: Robot Manipulation, Benchmark, Memory-Augmented Policy, Vision-Language-Action Models
TL;DR: A large-scale robotic manipulation benchmark designed for history-dependent tasks, paired with a comprehensive study of memory-augmented vision–language–action models.
Abstract: Memory is critical for long-horizon and history-dependent robotic
manipulation. Such tasks often involve counting repeated actions or
manipulating objects that become temporarily occluded. Recent
vision-language-action (VLA) models have begun to incorporate memory
mechanisms; however, their evaluations remain confined to narrow,
non-standardized settings. This limits their systematic understanding,
comparison, and progress measurement. To address these challenges, we
introduce **RoboMME**: a large-scale standardized benchmark for
evaluating and advancing VLA models in long-horizon, history-dependent
scenarios. Our benchmark comprises 16 manipulation tasks constructed
under a carefully designed taxonomy that evaluates *temporal*, *spatial*,
*object*, and *procedural* memory. We further develop a suite of 14
memory-augmented VLA variants built on the $\pi_{0.5}$ backbone to
systematically explore different memory representations across multiple
integration strategies. We show that the effectiveness of memory
representations is highly task-dependent, with each design offering
distinct advantages and limitations across different tasks. Videos and
code can be found at https://anonymtest1.github.io/. This paper has been accepted by other conferences.
Submission Number: 30
Loading