3DiffTection: 3D Object Detection with Geometry-Aware Diffusion Features

20 Sept 2023 (modified: 25 Mar 2024)ICLR 2024 Conference Withdrawn SubmissionEveryoneRevisionsBibTeX
Keywords: 3D Computer Vision, Image Diffusion Networks, Novel View Synthesis, 3D Object Detection
TL;DR: 3DiffTection introduces a novel method for 3D detection using a 3D-aware diffusion model, bridging the gap between 2D and 3D tasks with specialized tuning, and outperforming established benchmarks
Abstract: We present 3DiffTection, a cutting-edge method for 3D detection from posed images, grounded in features from a 3D-aware diffusion model. Annotating large-scale image data for 3D object detection is both resource-intensive and time-consuming. Recently, large image diffusion models have gained traction as potent feature extractors for 2D perception tasks. However, since these features, originally trained on paired text and image data, are not directly adaptable to 3D tasks and often misalign with target data, our approach bridges these gaps through two specialized tuning strategies: geometric and semantic. For geometric tuning, we refine a diffusion model on a view synthesis task, introducing a novel epipolar warp operator. This task meets two pivotal criteria: the necessity for 3D awareness and reliance solely on posed image data, which are readily available (e.g., from videos). For semantic refinement, we further train the model on target data using box supervision. Both tuning phases employ a ControlNet to preserve the integrity of the original feature capabilities. In the final step, we harness these capabilities to conduct test-time prediction ensemble across multiple virtual viewpoints. Through this methodology, we derive 3D-aware features tailored for 3D detection and excel in identifying cross-view point correspondences. Consequently, the resulting model emerges as a powerful 3D detector, substantially surpassing previous benchmarks, e.g., Cube-RCNN, a precedent in single-view 3D detection by 9.43\% in AP3D-Near on the Omni3D-ARkitscene dataset. Furthermore, 3DiffTection showcases robust data efficiency and remarkable generalization to cross-domain data.
Supplementary Material: pdf
Primary Area: representation learning for computer vision, audio, language, and other modalities
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 2327
Loading