Abstract: With the rapid development of cameras and deep learning technologies, computer vision
tasks such as object detection, object segmentation and object tracking are being widely applied in
many fields of life. For robot grasping tasks, object segmentation aims to classify and localize objects,
which helps robots to be able to pick objects accurately. The state-of-the-art instance segmentation
network framework, Mask Region-Convolution Neural Network (Mask R-CNN), does not always
perform an excellent accurate segmentation at the edge or border of objects. The approach using
3D camera, however, is able to extract the entire (foreground) objects easily but can be difficult or
require a large amount of computation effort to classify it. We propose a novel approach, in which we
combine Mask R-CNN with 3D algorithms by adding a 3D process branch for instance segmentation.
Both outcomes of two branches are contemporaneously used to classify the pixels at the edge objects
by dealing with the spatial relationship between edge region and mask region. We analyze the
effectiveness of the method by testing with harsh cases of object positions, for example, objects
are closed, overlapped or obscured by each other to focus on edge and border segmentation. Our
proposed method is about 4 to 7% higher and more stable in IoU (intersection of union). This leads
to a reach of 46% of mAP (mean Average Precision), which is a higher accuracy than its counterpart.
The feasibility experiment shows that our method could be a remarkable promoting for the research
of the grasping robot.
Loading