TL;DR: We propose a suit of sanity checks to evaluate generative fidelity and diversity metrics, and find that all current metrics fail a significant number of checks.
Abstract: Any method's development and practical application is limited by our ability to measure its reliability. The popularity of generative modeling emphasizes the importance of good synthetic data metrics. Unfortunately, previous works have found many failure cases in current metrics, for example lack of outlier robustness and unclear lower and upper bounds. We propose a list of desiderata for synthetic data metrics, and a suite of sanity checks: carefully chosen simple experiments that aim to detect specific and known generative modeling failure modes. Based on these desiderata and the results of our checks, we arrive at our position: all current generative fidelity and diversity metrics are flawed. This significantly hinders practical use of synthetic data. Our aim is to convince the research community to spend more effort in developing metrics, instead of models. Additionally, through analyzing how current metrics fail, we provide practitioners with guidelines on how these metrics should (not) be used.
Lay Summary: Replacing data about real people with similar computer-generated synthetic data has several application in machine learning, including enhancing the privacy of individuals. Evaluating the quality of such synthetic data is challenging, and many evaluation metrics have been developed to measure various aspects of synthetic data quality.
We focus on two types of metrics: fidelity metrics that aim to evaluate how realistic synthetic data is, and diversity metrics that aim to evaluate how diverse the synthetic data is compared to real data. We check how well these metrics actually evaluate the qualities they aim to measure with very simple scenarios where it is obvious how a good metric should behave. We find that all currently existing metrics do not behave as they should on many of these checks, and conclude that they are flawed.
The flaws of current metrics mean that synthetic data evaluations using them may be drawing misleading conclusions about synthetic data quality. This means that results should be interpreted with the known flaws of the metrics in mind, and better metrics should be developed to fix as many flaws as possible.
Verify Author Names: My co-authors have confirmed that their names are spelled correctly both on OpenReview and in the camera-ready PDF. (If needed, please update ‘Preferred Name’ in OpenReview to match the PDF.)
No Additional Revisions: I understand that after the May 29 deadline, the camera-ready submission cannot be revised before the conference. I have verified with all authors that they approve of this version.
Pdf Appendices: My camera-ready PDF file contains both the main text (not exceeding the page limits) and all appendices that I wish to include. I understand that any other supplementary material (e.g., separate files previously uploaded to OpenReview) will not be visible in the PMLR proceedings.
Latest Style File: I have compiled the camera ready paper with the latest ICML2025 style files <https://media.icml.cc/Conferences/ICML2025/Styles/icml2025.zip> and the compiled PDF includes an unnumbered Impact Statement section.
Paper Verification Code: M2Y1Z
Link To Code: https://github.com/vanderschaarlab/position-fidelity-diversity-metrics-flawed
Permissions Form: pdf
Primary Area: Research Priorities, Methodology, and Evaluation
Keywords: Generative model evaluation, synthetic data
Submission Number: 38
Loading