Abstract: 3D object detection is a crucial research topic in computer vision, which usually uses 3D point clouds as input in
conventional setups. Recently, there is a trend of leveraging multiple sources of input data, such as complementing
the 3D point cloud with 2D images that often have richer
color and fewer noises. However, due to the heterogeneous
geometrics of the 2D and 3D representations, it prevents
us from applying off-the-shelf neural networks to achieve
multimodal fusion. To that end, we propose Bridged Transformer (BrT), an end-to-end architecture for 3D object detection. BrT is simple and effective, which learns to identify
3D and 2D object bounding boxes from both points and image patches. A key element of BrT lies in the utilization
of object queries for bridging 3D and 2D spaces, which
unifies different sources of data representations in Transformer. We adopt a form of feature aggregation realized by
point-to-patch projections which further strengthen the interaction between images and points. Moreover, BrT works
seamlessly for fusing the point cloud with multi-view images. We experimentally show that BrT surpasses state-ofthe-art methods on SUN RGB-D and ScanNetV2 datasets.
0 Replies
Loading