Keywords: representational alignment, Representational Similarity Analysis, RSA, benchmarking, neuro-AI, video AI, neuroscience, EEG
Abstract: The human brain is the most efficient and versatile system for processing dynamic visual input. By comparing representations from deep video models to brain activity, we can gain insights into mechanistic solutions for effective video processing, important to better understand the brain and to build better models.
Current works in model-brain alignment primarily focus on fMRI measurements, leaving open questions about fine-grained dynamic processing.
Here, we introduce the first large-scale benchmarking of both static and temporally-integrating deep neural networks on brain alignment to dynamic electroencephalography (EEG) recordings of short natural videos. We analyze 100+ models across the axes of temporal integration, classification task, architecture and pretraining using our proposed Cross-Temporal Representational Similarity Analysis (CT-RSA), which matches the best time-unfolded model features to dynamically evolving brain responses, distilling $10^7$ alignment scores. Our findings reveal novel insights on how continuous visual input is integrated in the brain, beyond the standard temporal processing hierarchy from low to high-level representations.
Responses in posterior electrodes, after initial alignment to hierarchical static object processing, best align to mid-level representations of temporally-integrative actions and closely match the unfolding video content.
In contrast, responses in frontal electrodes best align with high-level static action representations and show no temporal correspondence to the video.
Additionally, state space models show superior alignment to intermediate posterior activity through mid-level action features, in which self-supervised pretraining is also beneficial.
We draw a metaphor to a dynamic mixture of expert models for the changing neural preference in tasks and temporal integration reflected in the alignment to different model types across time.
We posit that a single best-aligned model would need task-independent training to combine these capacities as well as an architecture that supports dynamic switching.
Primary Area: applications to neuroscience & cognitive science
Submission Number: 20419
Loading