Track: Proceedings Track
Keywords: Action recognition, Masked Autoencoders, Convolutional Neural Networks, Human Visual system
TL;DR: Neural responses to videos align better with models trained on dynamic information, with optic flow models capturing unique brain activity that Masked Autoencoders miss, particularly in early visual processing stages.
Abstract: We compared neural responses to naturalistic videos and representations in deep network models trained with static and dynamic information. Models trained with dynamic information showed greater correspondence with neural representations in all brain regions, including those previously associated with the processing of static information. Among the models trained with dynamic information, those based on optic flow accounted for unique variance in neural responses that were not captured by Masked Autoencoders. This effect was strongest in ventral and dorsal brain regions, indicating that despite the Masked Autoencoders' effectiveness at a variety of tasks, their representations diverge from representations in the human brain in the early stages of visual processing.
Submission Number: 66
Loading