A Multi-Institutional Multimodal EEG Benchmark for Foundation Model Generalization and Early Neurological Diagnosis
Keywords: Clinical EEG, Foundation models, EEG dataset, Regional diversity, Alzheimer’s risk prediction, Representation learning, Self-supervised learning, Multimodal learning, Blood-based biomarkers
TL;DR: We present VEEG, the largest non-US clinical EEG dataset with multimodal extensions, providing a diverse pretraining resource that improves foundation model generalization and early Alzheimer’s risk prediction.
Abstract: Recent advances in deep learning have accelerated the development of foundation models (FMs) for electroencephalography (EEG), with significant efforts devoted to assembling EEG datasets and training large-scale models. However, existing EEG datasets remain highly fragmented and non-standardized, with limited regional diversity since most originate from the United States. Similarly, current EEG foundation models are trained on different datasets without consistent protocols, making it difficult to compare architectures fairly. Moreover, all existing models are trained exclusively on unimodal EEG signals, limiting their clinical utility, as many downstream diagnostic tasks, such as detecting neurodegenerative diseases, require integration of additional modalities beyond EEG.
To address these limitations, we introduce, for the first time VEEG, a multimodal EEG dataset comprising over $6000$ patients collected from two major hospitals outside the US. In parallel, we unify all existing public EEG datasets into a single standardized corpus, enabling the first rigorous benchmarking of state-of-the-art EEG foundation model architectures under consistent pretraining and fine-tuning pipelines. Finally, using our multimodal EEG dataset, we design and evaluate a multimodal diagnostic model, demonstrating that integrating auxiliary modalities (e.g., blood biomarkers and clinical notes) with EEG substantially improves downstream prediction accuracy, for instance, achieving a 27.64\% gain in Alzheimer’s disease risk prediction.
Supplementary Material: zip
Primary Area: datasets and benchmarks
Submission Number: 21953
Loading