Visio-Linguistic Brain Encoding

SUBBA REDDY OOTA; Jashn Arora; Vijay Rowtula; Manish Gupta; Bapi Raju Surampudi

Visio-Linguistic Brain Encoding

SUBBA REDDY OOTA, Jashn Arora, Vijay Rowtula, Manish Gupta, Bapi Raju Surampudi

29 Sept 2021 (modified: 22 Jun 2025)ICLR 2022 Conference Withdrawn SubmissionReaders: Everyone

Keywords: fMRI encoding, Vision Transformers, Multi-Modal Transformers

Abstract: Enabling effective brain-computer interfaces needs understanding how the human brain encodes stimuli across modalities such as visual, language (or text), etc. Brain encoding aims at constructing fMRI brain activity given a stimulus. There exist a plethora of neural encoding models which study brain encoding for single-mode stimuli: visual (pretrained CNNs) or text (pretrained language models). Few recent papers have also obtained separate visual and text representation models and performed late-fusion using simple heuristics. However, previous work has failed to explore: (a) the effectiveness of image Transformer models for encoding visual stimuli, and (b) co-attentive multi-modal modeling for visual and text reasoning. Further, as pretrained image Transformers and multi-modal Transformers have continued to evolve, it is important to understand if they are becoming more brain-like and hence lead to improved brain encoding. In this paper, we systematically explore the efficacy of image Transformers (ViT, DEiT, and BEiT) and multi-modal Transformers (VisualBERT, LXMERT, ViLBERT, and CLIP) for brain encoding. Extensive experiments on two popular datasets, BOLD5000 and Pereira, provide the following insights. (1) To the best of our knowledge, we are the first to investigate the effectiveness of image and multi-modal Transformers for brain encoding. (2) Surprisingly, we observe a better encoding correlation between Transformer model layers and the levels of visual processing in the human brain when compared to CNN architectures. (3) We find that multi-modal Transformers significantly outperform previously proposed single-mode CNNs, image Transformers as well as other previously proposed multi-modal models, thereby establishing new state-of-the-art. The supremacy of visio-linguistic models raises the question of whether the responses elicited in the visual regions are affected implicitly by linguistic processing even when passively viewing images. Future fMRI tasks can verify this computational insight in an appropriate experimental setting. We make our code publicly available.

Supplementary Material: zip

Community Implementations: [![CatalyzeX](/images/catalyzex_icon.svg) 2 code implementations](https://www.catalyzex.com/paper/visio-linguistic-brain-encoding/code)

5 Replies

Loading