Scaffolding Dexterous Manipulation with Vision Language Models

Published: 25 Jun 2025, Last Modified: 25 Jun 2025Dex-RSS-25EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Dexterous manipulation, residual RL, VLM
TL;DR: Uses VLMS to output reference trajectories for Rl-based tracking policies.
Abstract: Dexterous robotic hands are essential for performing complex manipulation tasks, yet remain difficult to train due to the challenges of demonstration collection and high-dimensional control. Thus, contemporary works in dexterous manipulation have often bootstrapped from reference trajectories to trajectories specify target hand poses that guide the exploration of RL policies and object poses that enable dense, task-agnostic rewards. However, sourcing suitable trajectories---particularly for dexterous hands---remains a significant challenge. Our key insight is that modern vision-language models (VLMs) already encode the commonsense spatial and semantic knowledge needed to specify tasks and guide exploration effectively. Given a task description (e.g., “open the cabinet”) and a visual scene, our method uses an off-the-shelf VLM to first identify task-relevant keypoints (e.g., handles, buttons) and then synthesize 3D trajectories for hand motion and object motion. Subsequently, we train a low-level residual RL policy in simulation to track these coarse trajectories or ``scaffolds'' with high fidelity.
Submission Number: 18
Loading