3D Noise and Adversarial Supervision Is All You Need for Multi-modal Semantic Image Synthesis

Vadim Sushko, Edgar Schönfeld, Dan Zhang, Jürgen Gall, Bernt Schiele, Anna Khoreva

2020 (modified: 26 Jul 2022)ECCV Workshops (6) 2020Readers: Everyone

Abstract: Semantic image synthesis models suffer from training instabilities and poor image quality when trained with adversarial supervision alone. Historically, this was alleviated via an additional VGG-based perceptual loss. Hence, we propose a new simplified GAN model, which needs only adversarial supervision to achieve high-quality results. In doing so, we also show that the VGG supervision decreases image diversity and can hurt image quality. We achieve the improvement by re-designing the discriminator as a semantic segmentation network. The resulting stronger supervision makes the VGG loss obsolete. Moreover, in contrast to previous work, we enable high-quality multi-modal image synthesis through a novel noise sampling scheme. Compared to the state of the art, we achieve an average improvement of 6 FID and 7 mIoU.

0 Replies