Keywords: Human vision, spatiotemporal representation learning, representation alignment
TL;DR: Training DNNs to learn the causal structure of spatiotemporal image data produces visual representations that are aligned with human decisions.
Abstract: The many successes of deep neural networks (DNNs) over the past decade have largely been driven by computational scale rather than insights from biological intelligence. While DNNs have nevertheless been surprisingly adept at explaining behavioral and neural recordings from humans, there is a growing number of reports indicating that DNNs are becoming progressively worse models of human vision as they improve on standard computer vision benchmarks. Here, we provide evidence that one path towards improving the alignment of DNNs with human vision is to train them with data and objective functions that more closely resemble those relied on by brains. We find that DNNs trained to capture the causal structure of large spatiotemporal object datasets learn generalizable object representations that exhibit smooth equivariance to 3-Dimensional (out-of-plane) variations in object pose and are predictive of human decisions and reaction times on popular psychophysics stimuli. Our work identifies novel data diets and objective functions that better align DNN vision with humans and can be easily scaled to generate the next generation of DNNs that behave as humans do.
Track: Extended Abstract Track
Submission Number: 69
Loading