Better models of human high-level visual cortex emerge from natural language supervision with a large and diverse dataset

Aria Y. Wang, Kendrick N. Kay, Thomas Naselaris, Michael J. Tarr, Leila Wehbe

Published: 2023, Last Modified: 15 Apr 2025Nat. Mac. Intell. 2023EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: High-performing neural networks for vision have dramatically advanced our ability to account for neural data in biological systems. Recently, further improvement in performance of these neural networks has been catalysed by joint training on images and natural language, increased dataset sizes and data diversity. We explored whether the same factors (joint training, dataset size and diversity) support similar improvements in the prediction of visual responses in the human brain. We used models pretrained with Contrastive Language-Image Pretraining (CLIP)—which learns image embeddings that best match text embeddings of image captions from diverse, large-scale datasets—to study visual representations. We built voxelwise encoding models based on CLIP image features to predict brain responses to real-world images. We found that ResNet50 with CLIP is a better model of high-level visual cortex, explaining up to R2 = 79% of variance in voxel responses in held-out test data, a substantial increase from models trained only with image/label pairs (ImageNet trained ResNet) or text (BERT). Comparisons across different model backbones ruled out network architecture as a factor in performance improvements. Comparisons across models that controlled for dataset size and data diversity demonstrated that language feedback along with large and diverse datasets are important factors in explaining neural responses in high-level visual brain regions. Visualizations of model embeddings and principal component analysis revealed that our models capture both global and fine-grained semantic dimensions represented within human visual cortex. Prediction of high-level visual representations in the human brain may benefit from multimodal sources in network training and the incorporation of complex datasets. Wang and colleagues show that language pretraining and a large, diverse dataset together build better models of higher-level visual cortex compared to earlier models.