A large-scale fMRI dataset for vision-language semantic association

Shurui Li, Zhenyu Jin, Shi Gu, Ru-Yuan Zhang, Yuanning Li

Published: 17 Apr 2026, Last Modified: 26 Apr 2026Scientific DataEveryoneCC BY-NC-ND 4.0

Abstract: Understanding the neural coding and association of visual and language information benefits from the development of deep learning models and the collection of massive datasets with extensive sampling of brain activity. Large-scale functional magnetic resonance imaging (fMRI) datasets with naturalistic stimuli provide more ecologically relevant experimental conditions and promote more reproducible research into the neural basis of sensory perception. Here, unlike most previous datasets restricted to isolated modalities, we present the Caption Scene Dataset (CSD), a large-scale fMRI dataset for vision-language semantic association, in which neural responses to 4,400 pairs of Chinese captions and naturalistic scenes were acquired from eight healthy participants. The participants were instructed to determine whether the semantics in the caption and the image are consistent. To illustrate the utility of the CSD dataset, we demonstrated that deep neural encoding models effectively predicted neural responses to both caption and image stimuli across different cortical regions. This dataset provides a platform for the investigation of the neural basis of semantic association across vision and language, facilitating cross-disciplinary advances between vision neuroscience and artificial intelligence.