Keywords: Brain-to-text reconstruction, Neural decoding, Semantic decoding, Multimodal learning
Abstract: The rapid progress of large vision-language models (VLMs), such as CLIP, has spurred the development of a wide range of neural decoding frameworks. Nevertheless, most existing approaches still suffer from semantic mismatches during representational alignment. This challenge may stem from the fact that the human brain does not distribute attention uniformly across a visual scene, but rather selectively encodes salient or relevant regions. Moreover, such selectivity is closely related to individual interests and varies from person to person. To address this challenge, we propose Yo'Mind, a novel optimal transport (OT)-driven personalized multimodal semantic masking framework designed to bridge the semantic gap between brain and machines in interpreting visual scenes. Technically, Yo'Mind introduces a dynamic semantic pruning and allocation mechanism that adaptively masks redundant visual semantic components in stimulus images based on individual neural responses—without requiring extra human supervision or hyperparameter tuning. This strategy can be used to enhance semantic consensus between brain and machine representations during decoding. Furthermore, the inherent flexibility of OT theory enables Yo'Mind to perform brain-visual-linguistic alignment and cross-subject decoding within a unified end-to-end architecture. Extensive experiments demonstrate that our Yo'Mind offers several advantages, including state-of-the-art brain-to-text reconstruction performance and improved interpretability of the decoding process.
Supplementary Material: zip
Primary Area: applications to neuroscience & cognitive science
Submission Number: 8461
Loading