Deep Neural Networks Can Learn Generalizable Same-Different Visual Relations

TMLR Paper2778 Authors

30 May 2024 (modified: 20 Sept 2024)Rejected by TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: Although deep neural networks can achieve human-level performance on many object recognition benchmarks, prior work suggests that these same models fail to learn simple abstract relations, such as determining whether two objects are the same or different. Much of this prior work focuses on training convolutional neural networks to classify images of two same or two different abstract shapes, testing generalization on within-distribution stimuli. In this article, we comprehensively study whether deep neural networks can acquire and generalize same-different relations both within and out-of-distribution using a variety of architectures, forms of pretraining, and fine-tuning datasets. We find that certain pretrained transformers can learn a same-different relation that generalizes with near perfect accuracy to out-of-distribution stimuli. Furthermore, we find that fine-tuning on abstract shapes that lack texture or color provides the strongest out-of-distribution generalization. Our results suggest that, with the right approach, deep neural networks can learn generalizable same-different visual relations.
Submission Length: Regular submission (no more than 12 pages of main content)
Changes Since Last Submission: - Added additional evaluation sets from Puebla and Bowers (2022) to Appendix A.1 - Added more detail in the introduction and discussion about the nature of CLIP ViT's same-different relation; in particular, we argue that the model learns a "fuzzy" same-different relation that uses an object embedding similarity threshold to compute equality (rather than an exact pixel-by-pixel comparison between objects) - Added multiple new investigations supporting the "fuzzy" same-different claim (above) in Appendix A.1; these investigations also provide further explanations for poor generalization performance of CLIP ViT on some of the evaluation sets from Puebla and Bowers (2022) - Added a further exploration of CLIP ViT's reflection invariance in Appendix A.1 (which also helps to explain model performance on some evaluation sets from Puebla and Bowers (2022))
Assigned Action Editor: ~Erin_Grant1
Submission Number: 2778
Loading