Multi-Modal Streaming 3D Object Detection

Mazen Abdelfattah, Kaiwen Yuan, Z. Jane Wang, Rabab Ward

Published: 01 Jan 2023, Last Modified: 05 Nov 2023IEEE Robotics Autom. Lett. 2023Readers: Everyone

Abstract: Modern autonomous vehicles rely heavily on mechanical LiDARs for perception. Current perception methods generally require <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"><tex-math notation="LaTeX">$360^\circ$</tex-math></inline-formula> point clouds, collected sequentially as the LiDAR scans the azimuth and acquires consecutive wedge-shaped slices. The acquisition latency of a full scan ( <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"><tex-math notation="LaTeX">$\sim\!\text{100}\;\text{ms}$</tex-math></inline-formula> ) may lead to outdated perception which is detrimental to safe operation. Recent streaming perception works proposed directly processing LiDAR slices and compensating for the narrow field of view (FOV) of a slice by reusing features from preceding slices. These works, however, are all based on a single modality and require past information which may be outdated. Meanwhile, images from high-frequency cameras can support streaming models as they provide a larger FoV compared to a LiDAR slice. However, this difference in FoV complicates sensor fusion. We propose an innovative camera-LiDAR streaming 3D object detection framework that uses camera images instead of past LiDAR slices to provide an up-to-date, dense, and wide context for streaming perception. The proposed method outperforms prior streaming models and powerful full-scan baselines on the challenging NuScenes benchmark in detection accuracy and end-to-end runtime. Our method is shown to be robust to missing camera images, narrow LiDAR slices, and small camera-LiDAR miscalibration.

0 Replies