URFormer: Unified Representation LiDAR-Camera 3D Object Detection with Transformer

Published: 01 Jan 2023, Last Modified: 15 May 2025PRCV (3) 2023EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Current LiDAR-camera 3D detectors adopt a 3D-2D design pattern. However, this paradigm ignores the dimensional gap between heterogeneous modalities (e.g., coordinate system, data distribution), leading to difficulties in marrying the geometric and semantic information of two modalities. Moreover, conventional 3D convolution neural networks (3D CNNs) backbone leads to limited receptive fields, which discourages the interaction between multi-modal features, especially in capturing long-range object context information. To this end, we propose a Unified Representation Transformer-based multi-modal 3D detector (URFormer) with better representation scheme and cross-modality interaction, which consists of three crucial components. First, we propose Depth-Aware Lift Module (DALM), which exploits depth information in 2D modality and lifts 2D representation into 3D at the pixel level, and naturally unifies inconsistent multi-modal representation. Second, we design a Sparse Transformer (SPTR) to enlarge effective receptive fields and capture long-range object semantic features for better interaction in multi-modal features. Finally, we design Unified Representation Fusion (URFusion) to integrate cross-modality features in a fine-grain manner. Extensive experiments are conducted to demonstrate the effectiveness of our method on KITTI benchmark and show remarkable performance compared to the state-of-the-art methods.
Loading