Abstract: Learning 3D scene representation from a single-view image is a long-standing
fundamental problem in computer vision, with the inherent ambiguity in predicting
contents unseen from the input view. Built on the recently proposed 3D Gaussian
Splatting (3DGS), the Splatter Image method has made promising progress on
fast single-image novel view synthesis via learning a single 3D Gaussian for each
pixel based on the U-Net feature map of an input image. However, it has limited
expressive power to represent occluded components that are not observable in the
input view. To address this problem, this paper presents a Hierarchical Splatter
Image method in which a pixel is worth more than one 3D Gaussians. Specifically,
each pixel is represented by a parent 3D Gaussian and a small number of child
3D Gaussians. Parent 3D Gaussians are learned as done in the vanilla Splatter
Image. Child 3D Gaussians are learned via a lightweight Multi-Layer Perceptron
(MLP) which takes as input the projected image features of a parent 3D Gaussian
and the embedding of a target camera view. Both parent and child 3D Gaussians
are learned end-to-end in a stage-wise way. The joint condition of input image
features from eyes of the parent Gaussians and the target camera position facilitates
learning to allocate child Gaussians to “see the unseen”, recovering the occluded
details that are often missed by parent Gaussians. In experiments, the proposed
method is tested on the ShapeNet-SRN and CO3D datasets with state-of-the-art
performance obtained, especially showing promising capabilities of reconstructing
occluded contents in the input view.
Loading