ThinkGrasp: A Vision-Language System for Strategic Part Grasping in Clutter

Published: 05 Sept 2024, Last Modified: 08 Nov 2024CoRL 2024EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Robotic Grasping, Vision-Language Models, Language Conditioned Grasping
TL;DR: We have developed ThinkGrasp, a plug-and-play vision-language grasping system for heavy clutter environment grasping strategies.
Abstract: Robotic grasping in cluttered environments remains a significant challenge due to occlusions and complex object arrangements. We have developed ThinkGrasp, a plug-and-play vision-language grasping system that makes use of GPT-4o's advanced contextual reasoning for grasping strategies. ThinkGrasp can effectively identify and generate grasp poses for target objects, even when they are heavily obstructed or nearly invisible, by using goal-oriented language to guide the removal of obstructing objects. This approach progressively uncovers the target object and ultimately grasps it with a few steps and a high success rate. In both simulated and real experiments, ThinkGrasp achieved a high success rate and significantly outperformed state-of-the-art methods in heavily cluttered environments or with diverse unseen objects, demonstrating strong generalization capabilities.
Supplementary Material: zip
Spotlight Video: mp4
Video: https://youtu.be/o5QHFhI95Qo
Website: https://h-freax.github.io/thinkgrasp_page/
Code: https://github.com/H-Freax/ThinkGrasp
Publication Agreement: pdf
Student Paper: yes
Submission Number: 472
Loading