Keywords: metric, classification boundary, neural networks
Abstract: We evaluate the classification of both human volunteers and various neural network models on a set of GAN-generated images that reflect the transition from one MNIST class to another. We find that models that obtain the same test accuracy on the standard MNIST test data set exhibit different behavior on these images. Further, we find that although the number of misclassified images decreases with test accuracy, the spread in predictions over multiple runs on images that are difficult to classify (for humans) also decreases with test accuracy. Our results raise the question of how we want networks to behave on images that could plausibly belong to multiple classes and hint at the value of complementing test accuracy with other evaluation metrics.
Category: Criticism of default practices: I would like to question some well-spread practice in the community
1 Reply
Loading