Abstract: Robotic table-top grasping a.k.a. bin-picking of unknown heterogeneous objects that are put together is a challenging problem due to clutter and uncertainty in depthdata. Out of many types of solutions to this problem, one uses a category-agnostic instance segmentation step before a graspplanning step to segment out the object boundaries. Recently, various such methods have been proposed that use depth data produced from industrial-grade sensors, as input to their trained deep-learning models to reliably segment out the object instances in the cluttered bin. However, generally, with the commonly used commodity-grade depth sensors the obtained depth data is noisy, particularly, unreliable for non opaque and thin objects. One other challenge could be time efficiency if the grasp-planning step needs to process each of the segmented objects in the graspplanning step. To address these challenges, we propose a unified depth-independent CNN design that aims to co-learn categoryagnostic instance segmentation, instance-wise grasp-confidencescores (GCS) and monocular depth estimation, given the RGB image of scene as the input. The estimated depth is used to detect collision in grasp pose prediction and to transform the planner grasp pose to the 3D world coordinates. A novel GCS branch is developed into the instance-segmentation model to help filter the detected object instances based on their graspability, avoiding the need of processing each object for grasp planning. A customgenerated synthetic dataset is leveraged to train the proposed CNN architecture. We show through experimentation that our proposed bin-picking system can reliably pick various kinds of unknown objects (opaque, non-opaque, and thin objects) from a clutter of around 20-40 objects.
External IDs:doi:10.36227/techrxiv.175289326.63384035/v1
Loading