Viewpoint-Agnostic Grasp Pipeline using VLM and Partial Observations

Published: 21 May 2026, Last Modified: 21 May 2026ICRA 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: robotics, grasping, manipulation, vision, machine learning, clutter, control, perception, locomanipulation
TL;DR: This paper presents an end-to-end pipeline for language-guided grasping that bridges open-vocabulary target selection to safe grasp execution on a real robot.
Abstract: Robust grasping in cluttered, unstructured environments remains challenging for mobile legged manipulators due to occlusions, unreliable depth, and the need for collision-free, execution-feasible approaches. We present an end-to-end pipeline for language-guided grasping that bridges open-vocabulary target selection to safe grasp execution on a real robot. Given a natural-language command, the system grounds the target in RGB using open-vocabulary detection and promptable segmentation, extracts an object-centric point cloud from RGB-D, and improves geometric reliability under occlusion via back-projected depth compensation and two-stage point cloud completion. We then generate and collision-filter 6-DoF grasp candidates and select an executable grasp using safety-oriented heuristics that account for reachability, approach feasibility, and clearance. We evaluate the method on a Boston Dynamics Spot with an arm in two cluttered tabletop scenarios, using paired trials against a view-dependent baseline. The proposed approach achieves a 90% success rate (9/10) versus 30% (3/10) for the baseline, demonstrating substantially improved robustness to occlusions and partial observations in clutter.
Submission Number: 41
Loading