CerebroVoice: A Stereotactic EEG Dataset and Benchmark for Bilingual Brain-to-Speech Synthesis and Activity Detection

Xueyi Zhang; Ruicong Wang; Peng Zhao; Siqi Cai; Haizhou Li

CerebroVoice: A Stereotactic EEG Dataset and Benchmark for Bilingual Brain-to-Speech Synthesis and Activity Detection

Xueyi Zhang, Ruicong Wang, Peng Zhao, Siqi Cai, Haizhou Li

27 Sept 2024 (modified: 05 Feb 2025)Submitted to ICLR 2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Brain-to-speech Synthesis, Voice Activity Detection, Stereotactic Electroencephalograph, Bilingual and Tonal Speech, Brain Computer Interface

TL;DR: We present CerebroVoice, the first public sEEG dataset for bilingual brain-to-speech synthesis and voice activity detection. Our MoBSE model shows significant performance improvements. We providing insights for brain-computer interfaces

Abstract: Brain signal to speech synthesis offers a new way of speech communication, enabling innovative services and applications. With high temporal and spatial resolution, invasive brain sensing such as stereotactic electroencephalography (sEEG) becomes one of the promising solutions to decode complex brain dynamics. However, such data are hard to come by. In this paper, we introduce a bilingual brain-to-speech synthesis (CerebroVoice) dataset: the first publicly accessible sEEG recordings curated for bilingual brain-to-speech synthesis. Specifically, the CerebroVoice dataset comprises sEEG signals recorded while the speakers are reading Chinese Mandarin words, English words, and Chinese Mandarin digits. We establish benchmarks for two tasks on the CerebroVoice dataset: speech synthesis and voice activity detection (VAD). For the speech synthesis task, the objective is to reconstruct the speech uttered by the participants based on their sEEG recordings. We propose a novel framework, Mixture of Bilingual Synergy Experts (MoBSE), which uses a language-aware dynamic organization of low-rank expert weights to enhance the efficiency of language-specific decoding tasks. The proposed MoBSE framework achieves significant performance improvements over current state-of-the-art methods, producing more natural and intelligible reconstructed speech. The VAD task aims to determine whether the speaker is actively speaking. In this benchmark, we adopt three established architectures and provide comprehensive evaluation metrics to assess their performance. Our findings indicate that low-frequency signals consistently outperform high-gamma activity across all metrics, suggesting that low-frequency filtering is more effective for VAD tasks. This finding provides valuable insights for advancing brain-computer interfaces in clinical applications. The CerebroVoice dataset and benchmarks are publicly available on Zenodo and GitHub for research purposes.

Supplementary Material: pdf

Primary Area: datasets and benchmarks

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 9343

Loading