Keywords: 3D scene understanding, self-attention, transformer
Abstract: The recent success of neural networks has enabled a better interpretation of 3D point clouds, but processing a large-scale 3D scene remains a challenging problem. Most approaches divide a large-scale 3D scene into multiple regions and combine the local predictions, but this inevitably increases inference time and involves preprocessing stages, such as k-nearest neighbor search. An alternative is to quantize the point cloud to voxels and process them with sparse convolution. Although sparse convolution is efficient and scalable for large 3D scenes, the quantization artifacts impair geometric details and degrade prediction accuracy. This paper proposes an Efficient Point Transformer (EPT) that effectively relieves the quantization artifacts and avoids expensive resource requirements. Each layer of EPT implements the local self-attention mechanism for analyzing continuous 3D coordinates and offers fast inference time using a voxel hashing-based architecture. The proposed method can be adopted for various 3D vision applications, such as 3D semantic segmentation and 3D detection. In experiments, the proposed EPT model outperforms the state-of-the-art on large-scale 3D semantic segmentation benchmarks and also shows better performance on 3D detection benchmarks than point-based or voxel-based baseline methods.
One-sentence Summary: This paper proposes an Efficient Point Transformer (EPT) that effectively relieves the quantization artifacts and avoids expensive resource requirements.
Supplementary Material: zip
5 Replies
Loading