Abstract: In this paper, we tackle the problem of RGB-D Semantic
Segmentation. The key challenges in solving this problem lie in 1) how
to extract features from depth sensor data and 2) how to effectively
fuse the features extracted from the two modalities. For the first challenge, we found that the depth information obtained from the sensor
is not always reliable (e.g. objects with reflective or dark surfaces typically have inaccurate or void sensor readings), and existing methods
that extract depth features using ConvNets did not explicitly consider
the reliability of depth value at different pixel locations. To tackle this
challenge, we propose a novel mechanism, namely Uncertainty-Aware
Self-Attention that explicitly controls the information flow from unreliable depth pixels to confident depth pixels during feature extraction. For
the second challenge, we propose an effective and scalable fusion module
based on Cross-Attention that can perform adaptive and asymmetric information exchange between the RGB and depth encoder. Our proposed
framework, namely UCTNet, is an encoder-decoder network that naturally incorporates these two key designs for robust and accurate RGB-D
Segmentation. Experimental results show that UCTNet outperforms existing works and achieves state-of-the-art performances on two RGB-D
Semantic Segmentation benchmarks.
Loading