Keywords: language-alignment, visual representation, neural regression, multimodality, CLIP, SimCLR
TL;DR: We showcase two examples of using controlled comparisons across unimodal and multimodal foundation models to assess the impact of language-alignment on human brain and behavior
Abstract: One of the core algorithmic forces driving the development of modern foundation models is the use of contrastive language alignment to facilitate more robust visual representation learning. The clear benefits conferred by CLIP-style multimodal objective functions in computer vision have generated a frenzy of interest in the application of these models to a long-debated question in cognitive neuroscience: to what extent does language shape perceptual representation in the human mind? In this work, we explore this question in two distinct domains: the prediction of brain activity in the human ventral visual system (as measured by high-resolution fMRI), and the prediction of visually evoked affect in human image assessment (as measured by self-report). In both of these cases, we leverage popular open-source foundation models (e.g. OpenAI's CLIP) in conjunction with empirically controlled alternatives (e.g. Meta AI's SLIP models) to better isolate the effects of language alignment while holding architecture and dataset constant. These controlled experiments offer mixed evidence regarding the influence of language on perceptual representation: specifically, when architecture and dataset are held constant, we find no evidence that language-alignment improves the brain predictivity of vision models, but we do find strong evidence that it increases predictivity of behavioral image assessments. We offer these examples as a case study in the urgency of injecting greater empirical control into the development and evaluation of foundation models, whose emergent properties may be attributable to a variety of sources that only systematic model comparison can fully disentangle.