Abstract: Recent works on 3D object detection take the range image
as input, which have achieved comparable performance
with bird’s eye view (BEV) based methods. Compared to
BEV, range view provides dense and compact observations
which allows for more popular feature encoders. To leverage
complementary information of range view and BEV,
we present ACDet - a novel single-stage multi-view fusion
method. Rather than fusing point-level features from range
view and BEV at early stage, the key contribution is that we
introduce an attentive cross-view fusion module based on
transformer to fuse higher level features, and further adopt
a supervised foreground mask learned from BEV features to
enhance the fused features. Notably, a geometric-attention
kernel is proposed to enhance features extracted from range
image. Finally, we design an anchor-free detection head
with optimized label assignment strategy, and its performance
exceeds the existing anchor-based and anchor-free
3D detection heads by a large margin. We evaluate our
ACDet model extensively on the KITTI dataset and Waymo
Open Dataset (WOD). ACDet outperforms most of singlestage
models on KITTI dataset in terms of multi-class 3D
and BEV mean average precision. ACDet also outperforms
both range-view and multi-view fusion methods on WOD.
0 Replies
Loading