Abstract: The recent advances in query-based multi-camera 3D object detection are featured by initializing object queries in the 3D space, and then sampling features from perspective-view images to perform multi-round query refinement. In such a framework, query points near the same camera ray are likely to sample similar features from very close pixels, resulting in ambiguous query features and degraded detection accuracy. To this end, we introduce RayFormer, a camera-ray-inspired query-based 3D object detector that aligns the initialization and feature extraction of object queries with the optical characteristics of cameras. Specifically, RayFormer transforms perspective-view image features into bird’s eye view (BEV) via the lift-splat-shoot method and segments the BEV map to sectors based on the camera rays. Object queries are uniformly and sparsely initialized along each camera ray, facilitating the projection of different queries onto different areas in the image to extract distinct features. Besides, we leverage the instance information of images to supplement the uniformly initialized object queries by further involving additional queries along the ray from 2D object detection boxes. To extract unique object-level features that cater to distinct queries, we design a ray sampling method that suitably organizes the distribution of feature sampling points on both images and bird’s eye view. Extensive experiments are conducted on the nuScenes dataset to validate our proposed ray-inspired model design. The proposed RayFormer achieves 55.5% mAP and 63.4% NDS, respectively.
Primary Subject Area: [Content] Multimodal Fusion
Relevance To Conference: The study's pioneering work on multi-camera 3D object detection is highly relevant to multimodal processing. Object detection and feature extraction from multiple cameras are essential for understanding multimedia content in a variety of applications, such as video surveillance, autonomous driving, virtual and augmented reality, among others. Specifically, initializing queries in 3D space and projecting them onto different perspective images for feature sampling represents a breakthrough in achieving perspective unity, which can significantly improve the interpretation of complex multimodal data. This technique not only offers a richer, more comprehensive spatial context but also helps to overcome problems associated with view point changes, occlusion, and scale variation common to multimodal content. Therefore, the contributions from this research not only advance the techniques in the 3D object detection field, but also broaden the feasibility and performance of multimodal applications.
Supplementary Material: zip
Submission Number: 2572
Loading