Keywords: generative models, evaluation, robustness
Abstract: Reliable evaluation of visual generative models has been a long-lasting problem. Existing evaluation metrics like Inception score and FID all follow the same methodology, that is, to calculate feature statistics of generated images based on a backbone network pretrained from real-world images (e.g., ImageNet). However, recent papers find that these methods are often biased and inconsistent with humans. Besides, we find that these metrics are very sensitive to slight (even imperceptible) image perturbations. To develop a more robust metric aligned with humans, we explore a new \emph{reversed} approach, that is to pretrain a model from generated training data and evaluate it on natural test data. It is based on the insight that a lower test error on natural data would, in turn, indicate that the training data are of higher quality. We show that this metric, we call Virtual Classifier Error (VCE), aligns better with human evaluation compared to FID, while being more robust against image noises. Conceptually, VCE suggests a new pragmatic perspective to measure data quality by their usefulness for model training instead of perceptual similarities.
Submission Number: 64
Loading