Keywords: Long-Tailed 3D Object Detection, Vision Foundation Model, Multimodal Fusion, Autonomous Vehicles
TL;DR: We propose FOMO-3D, the first to leverage vision foundation models (specifically OWLv2 for 2D object detection and Metric3Dv2 for dense depths) with novel multi-modal fusion designs to tackle long-tailed 3D object detection.
Abstract: In order to navigate complex traffic environments, self-driving vehicles
must recognize many semantic classes pertaining to vulnerable road users or
traffic control devices. However, many safety-critical objects (e.g.,
construction worker) appear infrequently in nominal traffic conditions, leading to a
severe shortage of training examples from driving data alone.
Recent vision foundation models, which are trained on a large corpus of
data, can serve as a good source of external prior knowledge to improve
generalization. We propose FOMO-3D, the first 3D detector
to leverage vision foundation models for long-tailed 3D detection. Specifically,
FOMO-3D exploits rich semantic and depth priors from OWLv2 and Metric3Dv2 within
a two-stage detection paradigm that first generates proposals with a
LiDAR-based branch and a novel camera-based branch, and refines them with
attention especially to image features from OWL. Evaluations on real-world
driving data show that using rich priors from vision
foundation models with careful multimodal fusion designs leads to large gains
for long-tailed 3D detection.
Supplementary Material: zip
Spotlight: zip
Submission Number: 203
Loading