FOMO-3D: Using Vision Foundation Models for Long-Tailed 3D Object Detection

Anqi Joyce Yang; James Tu; Nikita Dvornik; Enxu Li; Raquel Urtasun

FOMO-3D: Using Vision Foundation Models for Long-Tailed 3D Object Detection

Anqi Joyce Yang, James Tu, Nikita Dvornik, Enxu Li, Raquel Urtasun

Published: 08 Aug 2025, Last Modified: 16 Sept 2025CoRL 2025 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Long-Tailed 3D Object Detection, Vision Foundation Model, Multimodal Fusion, Autonomous Vehicles

TL;DR: We propose FOMO-3D, the first to leverage vision foundation models (specifically OWLv2 for 2D object detection and Metric3Dv2 for dense depths) with novel multi-modal fusion designs to tackle long-tailed 3D object detection.

Abstract: In order to navigate complex traffic environments, self-driving vehicles must recognize many semantic classes pertaining to vulnerable road users or traffic control devices. However, many safety-critical objects (e.g., construction worker) appear infrequently in nominal traffic conditions, leading to a severe shortage of training examples from driving data alone. Recent vision foundation models, which are trained on a large corpus of data, can serve as a good source of external prior knowledge to improve generalization. We propose FOMO-3D, the first 3D detector to leverage vision foundation models for long-tailed 3D detection. Specifically, FOMO-3D exploits rich semantic and depth priors from OWLv2 and Metric3Dv2 within a two-stage detection paradigm that first generates proposals with a LiDAR-based branch and a novel camera-based branch, and refines them with attention especially to image features from OWL. Evaluations on real-world driving data show that using rich priors from vision foundation models with careful multimodal fusion designs leads to large gains for long-tailed 3D detection.

Supplementary Material: zip

Spotlight: zip

Submission Number: 203

Loading