- Abstract: Recent advances in conditional image generation tasks, such as image-to-image translation and image inpainting, can largely be accounted to the success of conditional GAN models, which are often optimized by the joint use of the GAN loss with the reconstruction loss. However, we show that this training recipe shared by almost all existing methods is problematic and has one critical side effect: lack of diversity in output samples. In order to accomplish both training stability and multimodal output generation, we propose novel training schemes with a new set of losses that simply replace the reconstruction loss, and thus are applicable to any conditional generation task. We show this by performing thorough experiments on image-to-image translation, super-resolution, and image inpainting tasks using Cityscapes, and CelebA dataset. Quantitative evaluation also confirms that our methods achieve a great diversity in outputs while retaining or even improving the quality of images.
- Keywords: conditional GANs, conditional image generation, multimodal generation, reconstruction loss, maximum likelihood estimation, moment matching
- TL;DR: We prove that the mode collapse in conditional GANs is largely attributed to a mismatch between reconstruction loss and GAN loss and introduce a set of novel loss functions as alternatives for reconstruction loss.