Abstract: The mechanisms used by the human visual system and artificial convolutional neural networks (CNN) to understand images are vastly different. The two systems have different notions of hardness, meaning the set of images which appear to be ambiguous and hard to classify are different. In this paper, we answer the following question: are there measures we can compute in the trained CNN models that correspond closely to human visual hardness? We employ human selection frequency, the frequency with which human annotators label a given image, as a surrogate for human visual hardness. This information is recently made available on the ImageNet validation set~\cite{recht2019imagenet}. The CNN model confidence does not correlate well with this human visual hardness score, and it is not surprising given that there are calibration issues in the models. We propose a novel measure known as angular visual hardness (AVH). It is the normalized angular distance between the image feature embedding and the weights of the target category. We demonstrate that AVH is strongly correlated with human visual hardness across a broad range of CNN architectures. We conduct an in-depth scientific study and test multiple hypotheses to draw this conclusion. We observe that CNN models with the highest validation accuracy also have the best AVH scores. This agrees with the earlier finding that the state-of-art (SOTA) models are improving classification of harder examples. We also observe that during the training of CNNs, AVH reaches a plateau in early stages even as the training loss keeps improving. We conjecture the different causes for such plateau of easy and hard examples, which suggests the need to design better loss functions that can target harder examples more effectively and improve SOTA accuracy.
1 Reply
Loading