Position: All Current Generative Fidelity and Diversity Metrics are Flawed

Ossi Räisä; Boris van Breugel; Mihaela van der Schaar

Position: All Current Generative Fidelity and Diversity Metrics are Flawed

Ossi Räisä, Boris van Breugel, Mihaela van der Schaar

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 Position Paper Track posterEveryoneRevisionsBibTeXCC BY 4.0

TL;DR: We propose a suit of sanity checks to evaluate generative fidelity and diversity metrics, and find that all current metrics fail a significant number of checks.

Abstract: Any method's development and practical application is limited by our ability to measure its reliability. The popularity of generative modeling emphasizes the importance of good synthetic data metrics. Unfortunately, previous works have found many failure cases in current metrics, for example lack of outlier robustness and unclear lower and upper bounds. We propose a list of desiderata for synthetic data metrics, and a suite of sanity checks: carefully chosen simple experiments that aim to detect specific and known generative modeling failure modes. Based on these desiderata and the results of our checks, we arrive at our position: all current generative fidelity and diversity metrics are flawed. This significantly hinders practical use of synthetic data. Our aim is to convince the research community to spend more effort in developing metrics, instead of models. Additionally, through analyzing how current metrics fail, we provide practitioners with guidelines on how these metrics should (not) be used.

Lay Summary: Replacing data about real people with similar computer-generated synthetic data has several application in machine learning, including enhancing the privacy of individuals. Evaluating the quality of such synthetic data is challenging, and many evaluation metrics have been developed to measure various aspects of synthetic data quality. We focus on two types of metrics: fidelity metrics that aim to evaluate how realistic synthetic data is, and diversity metrics that aim to evaluate how diverse the synthetic data is compared to real data. We check how well these metrics actually evaluate the qualities they aim to measure with very simple scenarios where it is obvious how a good metric should behave. We find that all currently existing metrics do not behave as they should on many of these checks, and conclude that they are flawed. The flaws of current metrics mean that synthetic data evaluations using them may be drawing misleading conclusions about synthetic data quality. This means that results should be interpreted with the known flaws of the metrics in mind, and better metrics should be developed to fix as many flaws as possible.

Link To Code: https://github.com/vanderschaarlab/position-fidelity-diversity-metrics-flawed

Primary Area: Research Priorities, Methodology, and Evaluation

Keywords: Generative model evaluation, synthetic data

Submission Number: 38

Loading