Inferring 3D Occupancy Fields through Implicit Reasoning on Silhouette Images

Published: 20 Jul 2024, Last Modified: 21 Jul 2024MM2024 PosterEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Implicit 3D representations have shown great promise in deep learning-based 3D reconstruction. With differentiable renderers, current methods are able to learn implicit occupancy fields without 3D supervision by minimizing the error between the images rendered from the learned occupancy fields and 2D ground truth images. In this paper, however, we hypothesize that a full rendering pipeline including visibility determination and evaluation of a shading model is not required for the learning of 3D shapes without 3D supervision. Instead, we propose to use implicit reasoning, that is, we reason directly on the implicit occupancy field without explicit rendering. This leads our method to reveal highly accurate 3D structures from low quality silhouette images. Our implicit reasoning infers a 3D occupancy field by evaluating how well it matches with multiple 2D occupancy maps, using occupancy clues rather than rendering the 3D occupancy field into images. We exploit the occupancy clues that indicate whether a viewing ray inside a 2D object silhouette hits at least one occupied 3D location, or whether a ray outside the silhouette hits no occupied location. In contrast to differentiable renderers whose losses do not distinguish between the inside and outside of objects, our novel loss function weights unoccupied clues more than occupied ones. Our results outperform recent state-of-the-art techniques, justifying that we can learn accurate occupancy fields only using sparse clues without an explicit rendering process.
Primary Subject Area: [Experience] Multimedia Applications
Secondary Subject Area: [Generation] Generative Multimedia
Relevance To Conference: Implicit functions have emerged as an important 3D representation in computer vision and multimedia domains. They enable deep neural networks to represent 3D shapes in a discriminative manner by learning mappings from 3D locations to their occupancy labels (or signed distance values). For learning with 3D supervision, 3D locations with known occupancy labels are sampled densely around 3D ground truth shapes, which are used as training samples.
Supplementary Material: zip
Submission Number: 5447
Loading