Abstract: The state-of-the-art group-free network (GFNet) has achieved superior performance for indoor scene 3D object detection. However, we find there is still room for improvement in the following three aspects. Firstly, seed point features extracted by multi-layer perception (MLP) in the backbone (PointNet++) neglect to consider the different importance of each level feature. Second, the single-scale transformer module in GFNet to handle hand-crafted grouping via Hough Voting cannot adequately model the relationship between points and objects. Finally, GFNet directly utilizes the decoders to predict detection results disregarding the different contributions of decoders at each stage. In this paper, we propose the group-free enhancement network (GFENet) to tackle the above issues. Specifically, our network mainly consists of three lifting modules: the weighted MLP (WMLP) module, the hierarchical-aware module, and the stage-aware module. The WMLP module adaptively combines features of different levels in the backbone before max-pooling for informative feature learning. The hierarchical-aware module formulates a hierarchical way to mitigate the negative impact of insufficient modeling of points and objects. The stage-aware module aggregates multi-stage predictions adaptively for better detection performance. Extensive experiments on ScanNet V2 and SUN RGB-D datasets demonstrate the effectiveness and advantages of our method against existing 3D object detection methods.
Loading