Keywords: Brain-vision mapping, neuro decoding, semantic selectivity
Abstract: Recent advances in vision-language models, such as CLIP, have enabled their widespread use in brain encoding and decoding, where global image embeddings serve as anchors linking visual stimuli to voxel-level brain responses. However, we observe that CLIP's global visual embeddings often exhibit hallucinatory semantics: they encode objects not explicitly present in an image but inferred from prior associations. This imaginative bias poses a significant challenge for brain-vision mapping, particularly for natural scenes containing multiple annotated objects, where human neural responses are constrained to what is actually perceived. To address this issue, we propose a framework that suppresses CLIP's visual hallucination by integrating object- and concept-level representations. First, we extract object-centric embeddings using segmentation masks, isolating visual features tied to explicitly present objects. Next, we stabilize these diverse segment embeddings with a concept bank of text-derived CLIP embeddings, aligning bottom-up perception with top-down categorical knowledge through cross-attention. The resulting concept-stabilized object features act as corrective signals to be fused with global scene embeddings to form de-hallucinated visual representations. Finally, these representations are used for voxel-wise regression. Experiments on the NSD dataset demonstrate that our method generates representations that better align with category-selective brain regions (bodies, faces, food, places, and words), leading to more accurate and reliable neuro-based image generation compared to standard CLIP regression. These results highlight the importance of suppressing model imagination in bridging human perception with multimodal foundation models and offer a new direction for robust, biologically grounded brain-vision alignment.
Primary Area: applications to neuroscience & cognitive science
Submission Number: 25
Loading