EVA: Achieving Discriminative and Semantically Faithful Multi-Scale EEG-Vision Alignment

Enze Shi; Huawen Hu; Yincheng Yao; Sigang Yu; Shu Zhang

EVA: Achieving Discriminative and Semantically Faithful Multi-Scale EEG-Vision Alignment

Enze Shi, Huawen Hu, Yincheng Yao, Sigang Yu, Shu Zhang

18 Sept 2025 (modified: 28 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: EEG, Contrastive learning, Dynamic channel clustering, Semantic alignment

TL;DR: EVA is the first unified framework to decode brain signals from images, videos, and 3D objects, achieving state-of-the-art performance through novel frequency processing and adaptive channel clustering.

Abstract: Decoding semantic information from electroencephalography (EEG) signals elicited by diverse visual stimuli remains a critical challenge in brain-computer interfaces and cognitive neuroscience. Existing approaches typically align EEG with single-modality visual stimuli but struggle to generalize across multiple modalities and temporal scales. We propose EVA (EEG-Vision Alignment), the first framework that unifies multi-scale EEG alignment with heterogeneous visual stimuli, including rapid image presentations, continuous video sequences, and 3D object rotations, within a single contrastive learning-based architecture. EVA’s Universal EEG Encoder features two key innovations: (1) a Frequency-Aware Dynamic Encoding (FADE) module that transforms EEG signals into the frequency domain via real-valued fast Fourier transform, enabling compact, adaptive representations through adjustable band-pass filtering; and (2) an Adaptive Channel Clustering (ACC) module that dynamically updates channel groupings using cross-attention and gradient-based optimization, capturing inter-channel synergies while mitigating noise. By optimizing EEG features to achieve both discriminative power for robust classification and semantic fidelity for high-quality reconstruction from brain signals, our framework achieves state-of-the-art performance across diverse tasks, including image retrieval, video classification, and 3D object recognition, on multiple datasets. Notably, our zero-shot reconstruction of 200 object categories from the THINGS-EEG dataset, using only aligned EEG features without textual or low-level cues, surpasses prior state-of-the-art by a significant margin. These results underscore EVA’s capability to extract robust, generalizable representations from EEG signals, demonstrating the superiority of our unified framework. Code will be released upon publication.

Supplementary Material: zip

Primary Area: applications to neuroscience & cognitive science

Submission Number: 12400

Loading