Do Masked Autoencoders Learn a Human-Like Geometry of Neural Representation? Divergence and Convergence Across Brains and Machines During Naturalistic Vision
Reviewer: ~Hamed_Karimi1
Presenter: ~Hamed_Karimi1
TL;DR: Masked Autoencoders yield visual representations that diverge from human neural responses, with video MAEs containing temporal information showing closer alignment than image MAEs, but optic flow-based convolutional networks outperform both.
Abstract: Visual representations in the human brain are shaped by the pressure to support planning and interactions with the environment. Do visual representations in deep network models converge with visual representations in humans? Here, we investigate this question for a new class of effective self-supervised models: Masked Autoencoders (MAEs). We compare image MAEs and video MAEs to neural responses in humans as well as convolutional neural networks. The results reveal that representations learned by MAEs diverge from neural representations in humans and convolutional neural networks. Fine-tuning MAEs with a supervised task improves their correspondence with neural responses but is not sufficient to bridge the gap that separates them from supervised convolutional networks. Finally, video MAEs show closer correspondence to neural representations than image MAEs, revealing an important role of temporal information. However, convolutional networks based on optic flow show a closer correspondence to neural responses in humans than even video MAEs, indicating that while masked autoencoding yields visual representations that are effective at multiple downstream tasks, it is not sufficient to learn representations that converge with human vision.
Length: long paper (up to 8 pages)
Domain: methods
Format Check: Yes, the presenting author will attend in person if this work is accepted to the workshop.
Author List Check: The author list is correctly ordered and I understand that additions and removals will not be allowed after the abstract submission deadline.
Anonymization Check: This submission has been anonymized for double-blind review via the removal of identifying information such as names, affiliations, and URLs that point to identifying information.
Submission Number: 44
Loading