Vision and language representations in multimodal AI models and human social brain regions during natural movie viewing
Track: Proceedings Track
Keywords: multimodal transformers, audiovisual, fMRI, naturalistic stimuli, vision-language, neuroAI
Abstract: Recent work in NeuroAI suggests that representations in modern AI vision and language models are highly aligned with each other and human visual cortex. In addition, training AI vision models on language-aligned tasks (e.g., CLIP-style models) improves their match to visual cortex, particularly in regions involved in social perception, suggesting these brain regions may be similarly "language aligned". This prior work has primarily investigated only static stimuli without language, but in our daily lives, we experience the dynamic visual world and communicate about it using language simultaneously. To understand the processing of vision and language during natural viewing, we fit an encoding model to predict voxel-wise responses to an audiovisual movie using visual representations from both purely visual and language-aligned vision transformer models and paired language transformers. We first find that in naturalistic settings, there is remarkably low correlation between representations in vision and language models and both predict social perceptual and language regions well. Next, we find that language-alignment does not improve a vision model embedding's match to neural responses in social perceptual regions, despite these regions being well predicted by both vision and language embeddings. Preliminary analyses, however, suggest that vision-alignment does improve a language model's ability to match neural responses in language regions during audiovisual processing. Our work demonstrates the importance of testing multimodal AI models in naturalistic settings and reveals differences between language alignment in modern AI models and the human brain.
Submission Number: 58
Loading