Track: Extended Abstract Track
Keywords: NeuroAI, Foundation models, Multimodal encoding, Multimodal learning, Representation learning, Brain encoding models, Brain–machine alignment, Cross-modal alignment, Transformer architectures
Abstract: Foundation models enable image-to-brain encoders that scale across cortical regions and subjects. VISGate couples a frozen DINOv2 backbone with a lightweight ROI-Transformer and two outputs: (i) voxel-wise fMRI response prediction and (ii) per-ROI caption embedding prediction. Trained on the Natural Scenes Dataset (NSD), the model yields robust voxel predictivity, with systematic variations across five cortical streams (early, midventral, midlateral, ventral, lateral). We evaluate per-voxel correlation, split-half noise ceilings, and normalized accuracy, and we visualize semantic category–wise ROI profiles. Across multiple NSD subjects, ventral and lateral ROIs dominate normalized accuracy, while the caption heads emphasis early and lateral ROIs, suggesting distinct and shared contributions of visual and linguistic components in brain responses to natural scenes.
Submission Number: 88
Loading