Abstract: Autonomous driving systems relying solely on ego-vehicle sensors face critical safety challenges due to limited perceptual scope and restricted long-range sensing capabilities. Roadside cameras offer a complementary paradigm capable of mitigating blindspots and extending the perceptual horizon. However, existing roadside detection methods struggle to accurately detect small, distant objects, which are crucial for timely hazard anticipation. In this work, we introduce BEVTemp, a 3D object detector that leverages multi-frame information to construct temporal stereo pairs, enabling enhanced long-range perception. Our key insight is to exploit information about nearby objects from past frames to facilitate detecting small, distant targets in the present view. Furthermore, we introduce a dedicated small object enhancement module to encode multi-scale features and strengthen localization signals, explicitly enhancing the detection of smaller classes like bicycles and pedestrians. Through these innovations, our framework delivers robust and accurate 3D object detection across diverse scales and distances prevalent in roadside environments. On the DAIR-V2X-I and Rope3D datasets, BEVTemp achieves state-of-the-art performance, surpassing all previous methods by a significant 4% margin. Our work pioneers the use of multi-frame cues for roadside perception and presents a comprehensive solution for long-range and small object detection in this domain.
Loading