Abstract: Efficient grasp pose detection is essential for robotic manipulation in cluttered scenes. However, most methods only utilize point clouds or images for prediction, ignoring the advantages of different features. In this paper, we present a multi-modal fusion network for joint segmentation and grasp pose detection. We design a point cloud and image co-guided feature fusion module that can be used to fuse features and adaptively estimate the importance of the point-pixel feature pairs. Moreover, we develop a seed point sampling algorithm that simultaneously considers the distance, semantics and attention scores. For selected seed points, we adopt a local feature aggregation module to fully utilize the local spatial features in the grasp region. Experimental results on the GraspNet-lBillion Dataset show that our network outperforms several state-of-the-art methods. We also conduct real robot grasping experiments to demonstrate the effectiveness of our approach.
0 Replies
Loading