Keywords: scene understanding, representation learning, multi-object scene decomposition, pose estimation, shape and appearance estimation
Abstract: Representing scenes at the granularity of objects is a prerequisite for scene understanding and decision making. We propose a novel approach for learning multi-object 3D scene representations from images. A recurrent encoder regresses a latent representation of 3D shapes, poses and texture of each object from an input RGB image. The 3D shapes are represented continuously in function-space as signed distance functions (SDF) which we efficiently pre-train from example shapes. By differentiable rendering, we train our model to decompose scenes self-supervised from RGB-D images. Our approach learns to decompose images into the constituent objects of the scene and to infer their shape, pose and texture properties from a single view. In experiments, we evaluate the accuracy of our model in inferring the 3D scene layout and demonstrate the capabilities of the generative 3D scene model.
One-sentence Summary: We propose a model to learn representations of scenes composed of multiple objects which explicitly describes the underlying 3D geometry by encoding the individual object poses, 3D shapes and texture.
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics
Supplementary Material: zip
Community Implementations: [![CatalyzeX](/images/catalyzex_icon.svg) 1 code implementation](https://www.catalyzex.com/paper/arxiv:2010.04030/code)
Reviewed Version (pdf): https://openreview.net/references/pdf?id=WdFdfZdzSe
8 Replies
Loading