GI-Grasp: Target-Oriented 6DoF Grasping Strategy with Grasp Intuition Based on Vision-Language Models

Tong Jia, Haiyu Zhang, Guowei Yang, Yizhe Liu, Hao Wang, Shiyi Guo

Published: 01 Jan 2024, Last Modified: 13 May 2025ICIRA (2) 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Robot grasping is widely recognized as a crucial component of robotics activities. Several deep learning based grasping algorithms for planar and 6-degree-of-freedom have been presented, and they have produced good results in simulation and real world. However, when these algorithms do grasping posture estimation, their projected grasping poses may not always make sense for the grasping site, even if they cover the item under consideration. These algorithms tend to focus on the thing as a whole and perform activities that differ significantly from human behavior. To that end, we propose our GI-Grasp, a novel strategy that allows the robot to perceive the object to be grasped at a finer scale by introducing vision-language models (VLMs) to determine which part of the object is more suitable for grasping, guiding the robot to act like a human. First, we segment the RGB images of the grasping scene into instances in order to detect and localize the items to be clutched. Secondly, we provide the robot with a priori knowledge of the objects to be grasped through VLMs to help the robot understand the compositional details of the objects to be grasped and identify the spatial constraints related to the grasping task. Finally, acceptable position prediction is combined with the grasping algorithm to improve the robot’s grasping accuracy. Our real-world experiments have proven that GI-Grasp of object features assists robots in grasping items in a more human like (and reasonable) style, increasing the success rate of grasping.