RIC: Rotate-Inpaint-Complete for Generalizable Scene Reconstruction

Published: 05 Apr 2024, Last Modified: 14 Apr 2024VLMNM 2024EveryoneRevisionsBibTeXCC BY 4.0
Keywords: RGB-D Perception; Deep Learning for Visual Perception; Visual Language Models; Robotics; Computer Vision
TL;DR: Text-to-image models pre-trained on large amounts of image data can be used to perform generalizable 3D scene reconstruction from a single viewpoint.
Abstract: General scene reconstruction refers to the task of estimating the full 3D geometry and texture of a scene containing previously unseen objects. In many practical applications such as AR/VR, autonomous navigation, and robotics, only a single view of the scene may be available, making the scene reconstruction task challenging. In this paper, we present a method for scene reconstruction by structurally breaking the problem into two steps: rendering novel views via inpainting and 2D to 3D scene lifting. Specifically, we leverage the generalization capability of large visual language models (DALL.E 2) to inpaint the missing areas of scene color images rendered from different views. Next, we lift these inpainted images to 3D by predicting the normals of the inpainted image and solving for the missing depth values. By predicting normals instead of depth directly, our method allows for robustness to changes in depth distributions and scale. With rigorous quantitative evaluation, we show that our method outperforms multiple baselines while providing generalization to novel objects and scenes. Code and data links can be found at https://samsunglabs.github.io/RIC-project-page/.
Supplementary Material: zip
Submission Number: 4
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview