Evaluation of Generative Models: An Empirical Study

TMLR Paper484 Authors

06 Oct 2022 (modified: 17 Sept 2024)Rejected by TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: Implicit generative models, which do not return likelihood values, such as generative adversarial networks and diffusion models, have become prevalent in recent years. While it’s true that these models have shown remarkable results, evaluating their performance is challenging. This issue is of vital importance to push research forward and identify meaningful gains from random noise. Currently, heuristic metrics such as the Inception score (IS) and Frechet Inception Distance (FID) are the most common evaluation metrics, but what they measure is not entirely clear. Additionally, there are questions regarding how meaningful their score actually is. In this work, we propose a novel evaluation protocol for likelihood-based generative models, based on generating a high-quality synthetic dataset on which we can estimate classical metrics for comparison. Our study shows that while FID and IS do correlate to several f-divergences, their ranking of close models can vary considerably making them problematic when used for fine-grained comparison. We further used this experimental setting to study which evaluation metric best correlates with our probabilistic metrics. Lastly, we also address some of the issues with FID score by investigating the features used for this metric.
Submission Length: Regular submission (no more than 12 pages of main content)
Changes Since Last Submission: 1. [section 2.1] Add a more intuitive explanation and a figure about KL divergence and its limitation. 2. [section 3] Add a link to ImageGPT checkpoints that generated NotImageNet32. 3. [section 3] Addressing the question of realism of NotimageNet32 by samples and a quantitative score of linear probability. 3. [section 4] Addressing the standard deviation error of the experiment. 4. [section 4.2] Add explanations to the table caption. 5. [section 5] Add explanation to figure 7 caption. 6. [appendix B.1] Added training score results of different models. 7. [appendix D] Add CelebA results as extensive qualitative analysis. 8. [appendix E] Add sample examples from the generation process both from PixelSnail and VD-VAE.
Assigned Action Editor: ~Balaji_Lakshminarayanan1
Submission Number: 484
Loading