Object-Centric Agentic Robot Policies

Published: 23 Sept 2025, Last Modified: 19 Nov 2025SpaVLE PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: manipulation, mobile manipulation, vision-language models, scene understanding, robot learning
Abstract: Executing open-ended natural language queries in previously unseen environments is a core problem in robotics. While recent advances in imitation learning and vision-language modeling have enabled promising end-to-end policies, these models struggle when faced with complex instructions and new scenes. Their short input context also limits their ability to solve tasks over larger spatial horizons. In this work, we introduce OCARP, a modular agentic robot policy that executes user queries by using a library of tools on a dynamic inventory of objects. The agent builds the inventory by grounding query-relevant objects using a rich 3D map representation that includes open-vocabulary descriptors and 3D affordances. By combining the flexible reasoning abilities of an agent with a general spatial representation, OCARP can execute complex open-vocabulary queries in a zero-shot manner. We showcase how OCARP can be deployed in both tabletop and mobile settings due to the underlying scalable map representation.
Submission Type: Long Research Paper (< 9 Pages)
Submission Number: 45
Loading