DynaMem: Online Dynamic Spatio-Semantic Memory for Open World Mobile Manipulation

Published: 07 May 2025, Last Modified: 07 May 2025ICRA Workshop Human-Centered Robot LearningEveryoneRevisionsBibTeXCC BY 4.0
Workshop Statement: DynaMem system directly aligns with the workshop theme of human-centered robot and large models. The system borrows the power of large language models (LLMs) and large multimodal models (LMMs) to enable a mobile robot to follow humans’ instructions such as "pick up object A and place it on receptacle B" in any unseen dynamic home environments. Specifically, by leveraging outputs from powerful pretrained models such as SigLip and GPT, DynaMem builds a real-time, adaptive 3D scene representation encoding all semantic information from environments and adapting to environmental changes online. This memory is critical for human-robot collaboration, as it allows robots to localize any object given human instructions in changing environments—such as adding or removing objects or humans moving around. Moreover, DynaMem also deploys an open vocabulary manipulation system with pretrained foundations models such as AnyGrasp and Owlv2, allowing the robot to pick and place objects in ways specified by humans. Therefore, the workshop’s focus on leveraging large models and datasets to enhance human-robot interaction resonates with DynaMem’s design.
Keywords: open vocabulary, mobile manipulation, foundations models, semantic memory, scene representation
TL;DR: We designed an online updatable spatial semantic memory for open world open vocabulary mobile manipulation generalizable to any unseen human environments
Abstract: Significant progress has been made in open-vocabulary mobile manipulation, where the goal is for a robot to perform tasks in any environment given a natural language description. However, most current systems assume a static environment, which limits the system’s applicability in real-world scenarios where environments frequently change due to human intervention or the robot’s own actions. In this work, we present DynaMem, a new approach to open-world mobile manipulation that uses a dynamic spatio-semantic memory to represent a robot’s environment. DynaMem constructs a 3D data structure to maintain a dynamic memory of point clouds, and answers open-vocabulary object localization queries using multimodal LLMs or open-vocabulary features generated by state-of-the-art vision-language models. Powered by DynaMem, our robots can explore novel environments, search for objects not found in memory, and continuously update the memory as objects move, appear, or disappear in the scene. We run extensive experiments on the Stretch SE3 robots in three real and nine offline scenes, and achieve an average pick-and-drop success rate of 70% on non-stationary objects, a 3X improvement over state-of-the-art static systems.
Supplementary Material: zip
Submission Number: 8
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview