Revealing Vision-Language Integration in the Brain with Multimodal Networks

22 Sept 2023 (modified: 11 Feb 2024)Submitted to ICLR 2024EveryoneRevisionsBibTeX
Primary Area: applications to neuroscience & cognitive science
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Keywords: Vision and language in the brain, multimodal processing, encoding models
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.
TL;DR: We identify areas of vision-language integration in the brain using specific comparisons between multimodal and unimodal models.
Abstract: We use multimodal deep neural networks to identify sites of multimodal integration in the human brain. These are regions where a multimodal language-vision model is better at predicting neural recordings (stereoelectroencephalography, SEEG) than either a unimodal language, unimodal vision model, or a linearly-integrated language-vision model. We use a wide range of state-of-the-art models spanning different architectures including Transformers and CNNs (ALBEF, BLIP, Flava, ConvNeXt, BEIT, SIMCLR, CLIP, SLIP) with different multimodal integration approaches to model the SEEG signal while subjects watched movies. As a key enabling step, we first demonstrate that the approach has the resolution to distinguish trained from randomly-initialized models for both language and vision; the inability to do so would fundamentally hinder further analysis. We show that trained models systematically outperform randomly initialized models in their ability to predict the SEEG signal. We then compare unimodal and multimodal models against one another. A key contribution is standardizing the methodology for doing so while carefully avoiding statistical artifacts. Since models all have different architectures, number of parameters, and training sets which can obscure the results, we then carry out a test between two controlled models: SLIP-Combo and SLIP-SimCLR which keep all of these attributes the same aside from multimodal input. Using this method, we identify neural sites (on average 141 out of 1090 total sites or 12.94\%) and brain regions where multimodal integration is occurring. We find numerous new sites of multimodal integration, many of which lie around the temporoparietal junction, long theorized to be a hub of multimodal integration.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.
Supplementary Material: zip
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 6104
Loading