Keywords: monocular 3D object detection, open-vocabulary 3D detection, promptable perception, in-the-wild 3D, 3D detection dataset, human-in-the-loop annotation, depth estimation, spatial intelligence, open-world perception
TL;DR: WildDet3D: a promptable monocular 3D detector (text/point/box + optional depth) paired with WildDet3D-Data, a 1M-image, 13.5K-category human-in-the-loop dataset. Sets SOTA on Omni3D, zero-shot AV2/ScanNet, and our 700+ in-the-wild benchmark.
Abstract: Understanding objects in 3D from a single image is a cornerstone of spatial intelligence. A key step toward this goal is monocular 3D object detection---recovering the extent, location, and orientation of objects from an input RGB image. To be practical in the open world, such a detector must generalize beyond closed-set categories, support diverse prompt modalities, and leverage geometric cues when available. Progress is hampered by two bottlenecks: existing methods are designed for a single prompt type and lack a mechanism to incorporate additional geometric cues, and current 3D datasets cover only narrow categories in controlled environments, limiting open-world transfer.
In this work we address both gaps. First, we introduce WildDet3D, a unified geometry-aware architecture that natively accepts text, point, and box prompts and can incorporate auxiliary depth signals at inference time. Second, we present WildDet3D-Data, the largest open 3D detection dataset to date, constructed through a human-in-the-loop pipeline in which crowd annotators select and rate the best 3D candidate per object, complemented by a VLM scorer aligned to these human judgments; this yields over 1M images across 13.5K categories in diverse real-world scenes.
WildDet3D establishes a new state-of-the-art across multiple benchmarks and settings. In the open-world setting, it achieves 22.6 / 24.8 $AP_{3D}$ on our newly introduced WildDet3D-Bench with text and box prompts. On Omni3D, it reaches 34.2 / 36.4 $AP_{3D}$ with text and box prompts, respectively. In zero-shot evaluation, it achieves 40.3 / 48.9 ODS on Argoverse 2 and ScanNet. Notably, incorporating depth cues at inference time yields substantial additional gains (+20.7 AP on average across settings).
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 25
Loading