MetaGraspNetV2: All-in-One Dataset Enabling Fast and Reliable Robotic Bin Picking via Object Relationship Reasoning and Dexterous Grasping
Abstract: Grasping unknown objects in unstructured environments is one of the most challenging and demanding tasks for robotic bin picking systems. Developing a holistic approach is crucial to building such dexterous bin picking systems to meet practical requirements on speed, cost and reliability. Proposed datasets so far focus only on challenging sub-problems and are therefore limited in their ability to leverage the complementary relationship between individual tasks. In this paper, we tackle this holistic data challenge and design MetaGraspNetV2, an all-in-one bin picking dataset consisting of (i) a photo-realistic dataset with over 296k images, which has been created through physics-based metaverse synthesis; and (ii) a real-world test dataset with 3.2k images featuring task-specific difficulty levels. Both datasets provide full annotations for amodal panoptic segmentation, object relationship detection, occlusion reasoning, 6-DoF pose estimation, and grasp detection for a parallel-jaw as well as a vacuum gripper. Extensive experiments demonstrate that our dataset outperforms state-of-the-art datasets in object detection, instance segmentation, amodal detection, parallel-jaw grasping, and vacuum grasping. Furthermore, leveraging the potential of our data for building holistic perception systems, we propose a single-shot-multi-pick (SSMP) grasping policy for scene understanding accelerated fast picking in high clutter. SSMP reasons about suitable manipulation orders for blindly picking multiple items given a single image acquisition. Physical robot experiments demonstrate that SSMP effectively speeds up cycle times through reducing image acquisitions by more than 47% while providing better grasp performance compared to state-of-the-art bin picking methods. Note to Practitioners—In robotic bin picking, most proposed methods and datasets focus on solving only one aspect of the grasping task, such as grasp point detection, object detection, or relationship reasoning. They do not address practical aspects such as the widespread use of vacuum grasp technology or the need for short cycle times. In practice, however, efficient bin picking solutions often rely on multiple task-specific methods. Hence, having one dataset for a large variety of vision-related tasks in robotic picking reduces data redundancy and enables the development of holistic methods. While deep learning has been proven highly effective for bin picking vision systems, it demands large, high-quality training datasets. Collecting such datasets in the real-world, while assuring label quality and consistency, is prohibitively expensive and time-consuming. To overcome these challenges, we set up a photo-realistic metaverse data generation pipeline and create a large-scale synthetic training dataset. Furthermore, we design a comprehensive real-world dataset for testing. Unlike previously proposed datasets, our datasets provide difficulty levels and annotations in simulation and real-world for a comprehensive list of high-level tasks, including amodal object detection, scene layout reasoning, and grasp detection. In real-world applications, cycle time is a critical factor affecting the productivity and profitability of a robotic system. We tackle time-efficiency through scene understanding and demonstrate the capability of our data regarding holistic system development by proposing a single-shot-multi-pick (SSMP) policy. Our SSMP algorithm, trained exclusively on our synthetic data, distinguishes between uncovered and occluded items, and infers specific manipulation orders to perform multiple blind picks in a single shot. Physical robot experiments show that SSMP was able to reduce image acquisitions by more than 47% without compromising grasp performance. This clearly demonstrates that SSMP, together with our dataset, paves the way for application-oriented research in time-critical bin picking.
Loading