Free-form language-based robotic reasoning and grasping

Published: 08 Oct 2025, Last Modified: 08 Oct 2025HEAI 25 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: spatial reasoning, robotic grasping, vision language models
TL;DR: We use advanced VLMs and visual prompt to reason from human instructions and object occlusions for zero-shot robotic grasping in clutter.
Abstract: Performing robotic grasping from cluttered bins based on human instructions requires understanding both free-form language and spatial object relationships. We propose FreeGrasp, a novel method that leverages pre-trained vision–language models (VLMs) for zero-shot reasoning over human instructions and object arrangements. Our approach represents objects as keypoints, enabling the VLM to infer grasp sequences and decide whether to grasp directly or remove occluding objects first. We further construct a synthetic dataset with annotated instructions and grasp sequences, and validate our method in both simulated and real-world settings with a robotic arm. Experiments demonstrate state-of-the-art performance in grasp reasoning and execution, highlighting the potential of VLMs for instruction-based reasoning and grasping. Project website: https://tev-fbk.github.io/FreeGrasp/.
Submission Number: 4
Loading