A Real-to-Sim-to-Real Approach to Robotic Manipulation with VLM-Generated Iterative Keypoint Rewards

Published: 24 Oct 2024, Last Modified: 06 Nov 2024LEAP 2024 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Robotic Manipulation, Vision and Language Models, Real-to-sim-to-real
TL;DR: A framework that uses vision-language models to generate keypoint-based rewards for robotic manipulation tasks, enabling real-to-sim-to-real transfer and multi-step task execution with adaptive replanning.
Abstract: Task specification for robotic manipulation in open-world environments is challenging. Importantly, this process requires flexible and adaptive objectives that align with human intentions and can evolve through iterative feedback. We introduce Iterative Keypoint Reward (IKER), a framework that leverages VLMs to generate and refine visually grounded reward functions serving as dynamic task specifications for multi-step manipulation tasks. Given RGB-D observations and free-form language instructions, IKER samples keypoints from the scene and utilizes VLMs to generate Python-based reward functions conditioned on these keypoints. These functions operate on the spatial relationships between keypoints, enabling precise SE(3) control and leveraging VLMs as proxies to encode human priors about robotic behaviors. We reconstruct real-world scenes in simulation and use the generated rewards to train RL policies, which are then deployed into the real world—forming a real-to-sim-to-real loop. Our approach demonstrates notable capabilities across diverse scenarios, including both prehensile and non-prehensile tasks, showcasing multi-step task execution, spontaneous error recovery, and on-the-fly strategy adjustments. The results highlight IKER's effectiveness in enabling robots to perform multi-step tasks in dynamic environments through iterative reward shaping. Project Page: https://iker-robot.github.io/
Submission Number: 34
Loading