Instruction-Tuned Video-Audio Models Elucidate Functional Specialization in Brain

Instruction-Tuned Video-Audio Models Elucidate Functional Specialization in Brain

ICLR 2026 Conference Submission12839 Authors

18 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: brain encoding, fMRI, multimodal instruction-tuned models, Video LLMs, Audio LLMs, multi-modal stimuli, Transformers, interpretability, pretrained multimodal video-audio LLMs

Abstract: Recent voxel-wise multimodal brain encoding studies have shown that multimodal Transformer models exhibit a higher degree of brain alignment compared to unimodal models in two distinct settings: when subjects are engaged in unimodal stimuli and when they are exposed to multimodal stimuli. Notably, this alignment is achieved even when these Transformer models are not trained on brain data. More recently, a new class of models, namely instruction-tuned multimodal models has emerged, demonstrating strong zero-shot performance across a variety of tasks. These models offer a promising direction for capturing task-specific representations that align closely with brain activity. However, prior work evaluating the brain alignment of multimodal large language models (MLLMs) has primarily focused on unimodal settings or relied on non-instruction-tuned multimodal models for multimodal stimuli. To address this gap, we investigate the brain alignment, i.e., measuring the degree of predictivity of neural activity using instruction-specific embeddings from six video and two audio MLLMs as participants engage in watching naturalistic movies (video included with audio). Experiments with 13 video task-specific instructions show that instruction-tuned video MLLMs significantly outperform in-context learning multimodal models (by 9%), non-instruction-tuned multimodal (by 15%) and unimodal models (by 20%). Specifically, our evaluation of MLLMs for both video and audio tasks using language-guided instructions shows clear disentanglement in task-specific representations from MLLMs, leading to precise differentiation of multimodal functional processing in the brain. We also find that MLLM layers align hierarchically with the brain, with early sensory areas showing strong alignment with early layers, while higher-level visual and language regions align more with middle to late layers. These findings provide clear evidence for the role of task-specific instructions in enhancing the alignment between brain activity and MLLMs, and open new avenues for mapping joint information processing in both systems.

Supplementary Material: zip

Primary Area: applications to neuroscience & cognitive science

Submission Number: 12839

Loading