An empirical study on evaluation metrics of generative adversarial networksDownload PDF

15 Feb 2018, 21:29 (modified: 10 Feb 2022, 11:31)ICLR 2018 Conference Blind SubmissionReaders: Everyone
Keywords: generative adversarial networks, evaluation metric
Abstract: Despite the widespread interest in generative adversarial networks (GANs), few works have studied the metrics that quantitatively evaluate GANs' performance. In this paper, we revisit several representative sample-based evaluation metrics for GANs, and address the important problem of \emph{how to evaluate the evaluation metrics}. We start with a few necessary conditions for metrics to produce meaningful scores, such as distinguishing real from generated samples, identifying mode dropping and mode collapsing, and detecting overfitting. Then with a series of carefully designed experiments, we are able to comprehensively investigate existing sample-based metrics and identify their strengths and limitations in practical settings. Based on these results, we observe that kernel Maximum Mean Discrepancy (MMD) and the 1-Nearest-Neighbour (1-NN) two-sample test seem to satisfy most of the desirable properties, provided that the distances between samples are computed in a suitable feature space. Our experiments also unveil interesting properties about the behavior of several popular GAN models, such as whether they are memorizing training samples, and how far these state-of-the-art GANs are from perfect.
Code: [![github](/images/github_icon.svg) xuqiantong/GAN-Metrics](https://github.com/xuqiantong/GAN-Metrics) + [![Papers with Code](/images/pwc_icon.svg) 3 community implementations](https://paperswithcode.com/paper/?openreview=Sy1f0e-R-)
Data: [ImageNet](https://paperswithcode.com/dataset/imagenet)
18 Replies

Loading