Abstract: Perceptual distances between images, as measured in the space of pre-trained deep features, have outperformed prior low-level, pixel-based metrics on assessing image similarity. While the capabilities of older and less accurate models such as AlexNet and VGG to capture perceptual similarity are well known, modern and more accurate models are less studied. In this paper, we present a large-scale empirical study to assess how well ImageNet classifiers perform on perceptual similarity. First, we observe a inverse correlation between ImageNet accuracy and Perceptual Scores of modern networks such as ResNets, EfficientNets, and Vision Transformers: that is better classifiers achieve worse Perceptual Scores. Then, we examine the ImageNet accuracy/Perceptual Score relationship on varying the depth, width, number of training steps, weight decay, label smoothing, and dropout. Higher accuracy improves Perceptual Score up to a certain point, but we uncover a Pareto frontier between accuracies and Perceptual Score in the mid-to-high accuracy regime. We explore this relationship further using a number of plausible hypotheses such as distortion invariance, spatial frequency sensitivity, and alternative perceptual functions. Interestingly we discover shallow ResNets and ResNets trained for less than 5 epochs only on ImageNet, whose emergent Perceptual Score matches the prior best networks trained directly on supervised human perceptual judgements.
License: Creative Commons Attribution 4.0 International (CC BY 4.0)
Submission Length: Long submission (more than 12 pages of main content)
Changes Since Last Submission: ### Experiments: * [7emx]: **Section 3.2:** Dynamic Range of PS - Added experiments on ground truth $p$ and simulated distances conditioned on $p$. * [7emx, 9e1v]: **Section 8.4:** Experiments on residual connections and low entropy features. * [9e1v]: **Appendix I:** EfficientNet Ablations * [KGpd]: **Appendix J:** Rank Correlation using Distance Margins * [KGpd]: **Appendix K:** Inverse-U on TID2013 ### Writing * [7emx]: Section 3.1, Added description on BAPPS Dataset. * [7emx]: Figure 2, Images from BAPPS Dataset * [7emx]: Moved Figures from previous Fig 3, Fig 4 and Fig 6 to below their corresponding paragraphs. They are now Figs 5 - Figs 9 and Figs 11 - Figs 17. * [7emx]: Expanded captions, made legends and axis limits more consistent across all figures. * [7emx]: Reference several sections from the Appendix directly in the main draft. * [7emx]: Section 8.2 denotes what "farther" and "nearer" patch mean more concretely. * [7emx]: Representations in Section 4, denote how we obtain representations from the baseline networks. * [KGpd]: Introduction: Improved high-level motivation. * [KGpd]: Broader Impact Statement
Assigned Action Editor: ~Jia-Bin_Huang1