Robustness to 3D Object Transformations in Humans and Image-Based Deep Neural Networks

Haider Al-Tahan; Farzad Shayanfar; Ehsan Tousi; Marieke Mur

Robustness to 3D Object Transformations in Humans and Image-Based Deep Neural Networks

Haider Al-Tahan, Farzad Shayanfar, Ehsan Tousi, Marieke Mur

Published: 14 May 2025, Last Modified: 13 Jul 2025CCN 2025 Proceedings asProceedingsTalkPosterEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Recent work at the intersection of psychology, neuroscience, and computer vision has advocated for the use of more realistic visual tasks in modeling human vision. Deep neural networks have become leading models of the primate visual system. However, their behavior under identity-preserving 3D object transformations, such as translation, scaling, and rotation, has not been thoroughly compared to humans. Here, we evaluate both humans and image-based deep neural networks, including vision-only and vision-language models trained with supervised, self-supervised, or weakly supervised objectives, on their ability to recognize objects undergoing such transformations. Humans (n=220) and models (n=169) were asked to categorize images of 3D objects, generated with a custom pipeline, into 16 object categories recognizable by both. Humans were time-limited to reduce reliance on recurrent processing. We find that both humans and models are robust to translation and scaling, but models struggle more with object rotation and are more sensitive to contextual changes. Humans and models agree on which in-depth object rotations are most challenging -- when humans struggle, models do too -- but humans are more robust and show more consistent category confusions with one another than with any model. By testing model families trained on different amounts of data and with different learning objectives, we show that data richness plays a substantial role in supporting robustness -- potentially more so than vision-language alignment. Our benchmark excludes models trained on video, multiview, or 3D data, but is in principle compatible with such models and may support their evaluation in future work. This study underscores the importance of using naturalistic visual tasks to model human object perception in complex, real-world scenarios, and introduces a benchmark - ORBIT (Object Recognition Benchmark for Invariance to Transformations) - for evaluating and developing computational models of human object recognition. Code and data for ORBIT are available at: https://github.com/haideraltahan/ORBIT.

Submission Number: 93

Loading