GAN2Shape - Create 3D Shape with 2D GANs

In this blog post, we will discuss the key points of the paper “Do 2D GANs Know 3D Shape? Unsupervised 3D Shape Reconstruction from 2D Image GANs” (GAN2Shape) by Pan et al. We will discuss both the theory and code (in the author’s GitHub repository and use a demo Colab notebook to show how GAN2Shape is able to transform 2D images to 3D images in multiple view images format.

The GAN2Shape paper presents the first attempt to directly mine 3D geometric cues from GANs trained on 2D RGB images. The technique can be used in very interesting real world scenarios such as image editing of relighting and object rotation after reconstructing 3D shapes from 2D images.

The previous attempts of 3D reconstruction using GANs have a number of limitations such as requiring 2D keypoint or 3D annotations, heavy memory consumption because of reliance on explicitly modeling 3D representation and rendering during training, lower 3D image generation quality than 2D counterparts, or assumptions such as object shapes are symmetric.

The GAN2Shape paper was the first attempt to reconstruct the 3D object shapes using GANs pretrained on 2D images only, without relying on the symmetry assumption of object shapes. It’s able to generate highly photo-realistic 3D-aware image manipulations: rotation and relighting without using external 3D models.

Relevant Topics

Before diving into the tutorial of the GAN2Shape paper, let’s first briefly review a few of the relevant concepts in case you are unfamiliar with them: 3D deep learning, GANs,StyleGAN2 and Unsup3D.

3D Deep Learning

Using deep learning models to analyze or synthesize 3D data is an interesting area and can have a wide range of applications: such as 3D art, self-driving cars, virtual reality and augmented reality.

If you are new to this topic, watch the 3D Deep Learning Tutorial by SU Lab at UCSD for a great overview, which is briefly summarized below.

Regular images are typically represented in 1D or 2D arrays while 3D images, on the other hand, can have different representation formats:

  • Multi-view images which can be captured by positioning multiple cameras that take photos from different angles of the same object or scene.
  • Volumetric display that is made of voxels, the equivalent of pixels in 2D images.
  • Point-cloud which consists of a set of points collected from raw sensors.
  • Mesh which consists of a set of points and the relationship of these points.

The common 3D deep learning tasks include image classification, which is a well solved problem in both 2D and 3D; 3D object detection, segmentation and reconstruction where using GAN for 3D reconstruction is one of the options.

There are different options for 3D image reconstruction: from single-view or from multi views. This GAN2Shape paper which we are discussing in this blog post, reconstructs 3D shape from a single view image leveraging the power of GANs.

GANs

Generative Adversarial Neural Networks (GANs) are a type of generative models that can be used for image synthesis among many other tasks such as super resolution and image colorization etc. There are two networks in a GAN: a generator and a discriminator. We input random noise to the generator which generates images that resemble the training images and the discriminator tries to distinguish between the generated fake images and the real (training) images. The generator improves the image quality with the feedback from the discriminator.

There have been many GAN variants since 2014 and these variants aim to either improve the GAN training stability and or the quality of the generated images.

StyleGAN2

StyleGAN2 Karras et al. introduced an improved GAN architecture that makes use of a style- based generator. This new generator firstly maps the input latent code to an intermediate latent space, which is then added to the generator for adjusting the “style” of the generated image. These modifications enable the generator to distinguish between various high level attributes from the low level ones, thus improving the interpolation properties as well the overall image quality.

StyleGAN2 architecture also facilitates GAN inversion, ie. projecting an image into the latent space, and can be converted back to the original sample with minimum reconstruction loss. This conversion and reconstruction process reveals the underlying mechanism of the generator which allows us to manipulate the attributes of the original image. This paper GAN Inversion: A Survey provides a good analysis of how GAN inversion works and its applications. GAN2Shape makes use of a pretrained StyleGAN2 generator for GAN inversion and discriminator for computing reconstruction loss.

Unsup3D

Unsup3D is an unsupervised 3D reconstruction model proposed by Shangzhe Wu et al. in CVPR 2020 (Best Paper Award). It uses four individual autoencoder networks to decompose a single 2D image based on factors of view, lighting, depth and albedo, assuming they are symmetric. GAN2Shape adopts Unsup3D architecture and improves upon it by making use of StyleGAN2 hich we will discuss in more detail below.

How GAN2Shape Works

Now that you have the background info on 3D deep learning and GAN variants related to the GAN2Shape paper, let’s take a look at how GAN2Shape works.

The complex architecture and training of GAN2Shape can be broken down into three steps and we will be explaining the theory behind each of them. In addition, we will walk through its code implementation and add links to all the important modules and functions from the official GAN2Shape GitHub repository.

Step 1: Creating Pseudo Samples

The first step in the GAN2Shape model architecture is the generation of pseudo samples. In this step, an input 2D image is passed onto 4 networks: view (V), light (L), depth (D) and albedo (A), and using the outputs of each of these networks we reconstruct a set of 2D images with different viewpoints and lighting conditions, which are referred to as pseudo samples. This method to recover 3D shape from a single view 2D image was introduced in Unsup3D by Shangzhe Wu et al. as mentioned above.

Step 1 Step 1: Four networks V, L, D and A initialized with a convex shape prior for generating a set of pseudo samples.

This problem of decomposing a 2D image to the four above-mentioned factors is an ill-posed one and hence we have to make an assumption inorder to solve it. So one thing we assume is that all objects including faces and cars have a convex shape prior, which provides a hint on initial viewpoint and lighting conditions.

To implement this firstly the depth map is initialized with an ellipsoid shape. The function that predicts depth, albedo, viewpoint, lighting are implemented using individual neural networks: The depth and albedo are generated by encoder-decoder networks. The viewpoint and lighting are measured using simple encoder networks.

In step 1 we only train the Albedo network, to optimize it we reconstruct the original input image from these four factors via a rendering process and then a reconstruction loss is also calculated by taking a weighted combination of L1 loss and Perceptual loss, introduced by Johnson et al.

To create pseudo samples, we randomly sample different lighting directions and viewpoints along with the depth and albedo network outputs we have obtained. If our input is a 2D image of a face, the pseudo samples will contain a set of images that indicates how the lighting changes when the face is rotated at different angles.

Code

The Encoder class defines the architecture for the View and Light networks and the EDDeconv class for the Albedo and Depth networks. All the processes discussed in this step are executed inside the forward_step1 function. Firstly the input image is passed onto the four networks for calculating the respective factors, and then using a Neural 3D Mesh Renderer defined in the Renderer class, the original input is reconstructed. Then to calculate L1 loss, the photometric_loss function is called and the PerceptualLoss class for calculating perceptual loss.

Step 2: Obtain Projected Samples

The pseudo samples we have at this point are useful images showing different viewpoints of the image and indicate how the change in light affects the image but at the same time it consists of unnatural shadows, distortions, so our next step is to transform them into photorealistic images.

This is where StyleGAN2 comes into picture. In step 2, we use a pretrained StyleGAN2 generator for GAN inversion and a pretrained StyleGAN2 discriminator for calculating the reconstruction loss in order to optimize the encoder network.

Step 2 Step 2: Conversion of Pseudo samples to Projected samples using GAN Inversion. An Encoder and a StyleGAN2 Generator is used for GAN inversion and a Discriminator network for calculating loss to optimize the encoder network.

We perform GAN inversion to these pseudo samples and convert each sample into a latent vector using a standard ResNet encoder. These latent vectors are then projected back to their original space using the StyleGAN2 generator. This way we project the pseudo samples into a GAN image manifold making them more photorealistic, and these new samples are termed as projected samples.

While performing GAN inversion the latent vectors obtained for the pseudo samples are added to the latent representation of the original input.This way we can make the generated images look much more realistic without actually changing other features such as face orientation and shading.

To measure the difference between the generated projected samples and the input pseudo samples, we make use of a discriminator network similar to the one in StyleGAN2 architecture. Both generated and original set of images are passed through the discriminator and the distance between the obtained features along with a regularization term is used as the reconstruction loss for this step. This method was proposed by Pan et al. The reconstruction loss further ensures that these generated samples will not have different lighting conditions and viewpoints compared to the pseudo samples.

$ \theta_{E} = arg_{\theta_{E}}min \frac{1}{m} \sum_{i=0}^m L’(I_{i},G(E(I_{i}) + w)) + \lambda ||E(I_{i})||_{2} $

The reconstruction objected for the encoder is represented in Eq.2, where G is the StyleGAN generator, E is the encoder Ľ represents the distance metric which is used to calculate the loss between the generated and original input features. A regularization term is further added to prevent the latent offset from growing too large.

Code

All the processes discussed in this step are implemented inside the forward_step2 function. The architecture for generator and discriminator are defined inside classes Generator and Discriminator respectively. Firstly the gan_invert function is called for performing GAN inversion. The encoder used here is defined in the class ResEncoder. After this we obtain the projected samples and then we calculate the reconstruction loss. The photometric_loss function is used to calculate L1 loss and the DiscriminatorLoss class calculates the features of image samples using the discriminator networks, which is then used for calculating the total reconstruction loss.

Step 3: From 2D to 3D

After Step 2 we have projected samples which are sets of photorealistic images of a particular object with multiple viewpoints and lighting conditions. To learn its 3D shape, we again make use of the four networks View, Light, Depth and Albedo used in Step 1.

The main differences in Step 3 are that

  • View and Lighting conditions are predicted using the projected samples generated in Step 2, and the V and L encoder networks take the projected samples as the input
  • The reconstruction loss is computed in a different way compared to Step 1, and the loss is used to optimize all four networks.
  • For generating the 3D view, viewpoint and lighting obtained from the projected samples are used for rendering, and the images generated at this stage are more photorealistic and better represents the 3D view of the input image.

Step 3 Step 3: Generating the 3D view using the 4 networks V, L, D and A. Viewpoint and Lighting are predicted from the projected samples. Reconstruction loss is also taken to optimize all the four networks.

In step 3 firstly the albedo and depth factors are predicted by the respective encoder-decoder networks using the original input. The viewpoint and lighting are predicted by the corresponding encoder networks using the projected samples generated in Step 2. All the 4 networks are jointly trained to reconstruct the original image and the reconstruction objective formulated in Eq.3, where I and Ĩ represents the original input image and the projected samples respectively. A smoothness loss is also added for overcoming gradient locality as proposed by Zho et al.

$\theta_{E},\theta_{A},\theta_{V},\theta_{L} = argmin_{\theta_{E}\theta_{A}\theta_{V}\theta_{L}} \frac{1}{m} \sum^m_{i=0} L(\tilde{I}_{i},\Phi(D(I),A(I),V(\tilde{I}),L(\tilde{I}))) + \lambda _{2} L _{smooth} (D(I))$

The four networks are similarly trained to render the 3D view of the input image, similar to how pseudo samples were generated in Step 1. These rendered sets of images are much more photorealistic in representing the 3D view compared to the pseudo samples.

Code

Processes involved in Step 3 are implemented inside the forward_step3 function. The Depth and Albedo encoder-decoder networks make predictions the same way as in Step 1 whereas View and Light encoders take in the projected samples as input for predicting the factors. The same Renderer used in Step 1 is used to reconstruct the original input. The photometric_loss function and the PerceptualLoss class are used to calculate the reconstruction loss similar to Step 1. The loss thus obtained is used to optimize the four networks.

Iterative self-refinement

One important note about this GAN2Shape model is that the training has more than one cycle of repeating the above 3 steps to transform a 2D image into a 3D image; however, those steps need to be repeated to refine the 3D image. The paper used four cycles (or stages) to repeat these 3 steps mentioned above.

Another note about the process: The GAN2Shape model needs to be separately trained for each 2D input image, which involves repeating the 3 training steps discussed above for 4 cycles (or stages) and the output thus obtained would contain a set of images that can be used to construct a multi-view 3D image.

GAN2Shape Colab Demo

We have discussed how GAN2Shape works in theory, and how each of the steps relates to the code in the author’s repo. Now let’s use this Colab notebook to demonstrate how GAN2Shape works, and discuss some of the output results. Please refer to the notebook for details of code implementation and here we are discussing the key steps only.

Dependencies

First we need to install the dependencies such as torch, torchvision, mmvc and pytorch neural render etc. Also note the dependency on neural renderer which is used for the rendering process.

Clone the repo

We clone the official GAN2Shape repo and change to the repo/ directory:

!git clone https://github.com/XingangPan/GAN2Shape.git repo
%cd repo/

Download training data and pretrained weights

Run download.sh shell script to download the repo release files such as data.tar.gz, X00, X01 etc. which contains the training data and pretrained weights.

!sh /content/repo/scripts/download.sh

After this step you will notice a /data folder containing car, cat and celeba etc. And the celeba folder contains 400 images from the CelebA dataset.

Datasets

Run scripts

Then we use the run.py script to train the GAN2Shape model on the celebA data.

!python run.py --launcher none --config configs/celeba.yml 2>&1 | tee results/celeba/log.txt

Note the authors of the paper experimented GAN2Shape on four different datasets while here we only tried on CelebA.

Results

The GAN2Shape model was successful in recovering 3D shapes from 2D images of numerous objects such as cars, buildings, human faces, cats etc. Some of the results obtained by running the scripts are shown below.

Prior to GAN2Shape, Unsup3D was the state of the art model for obtaining the 3D view of an input 2D image. But the model assumed every object to be symmetric and the lighting and textures were added accordingly to the 3D shape. GAN2Shape results showed that the model was successful even without the 3D assumption, producing a more realistic 3D view. The image below compares the performance of the two models.

Comparison Comparison between GAN2Shape and Unsup3D.

GAN2Shape works well for images of human or cat faces where a convex shape prior provides a hint on the viewpoint and lighting condition, but fails when this is not the case. Due to this GAN2Shape was observed to not perform well on the LSUN horse dataset.

Limitations Some examples where convex shape prior doesn’t hint at the view and lighting of the object, hence causing GAN2Shape to perform poorly.

Summary

In this post, we gave a brief introduction to 3D deep learning, GANs, StyleGAN2, and Unsup3D . Then we discussed in detail the key steps of GAN2Shape: how to transform a 2D image into 3D shapes, by using Unsup3D and StyleGAN2D. We demonstrated the training process with a Colab notebook to show the input image and the generated 3D images.

In summary, GAN2Shape is able to generate 3D shapes from 2D images which are readily available, without any additional annotations, 3D models or assumptions of object symmetry, and provide better results than previous 3D construction GAN models.

References