Data Contamination Issues in Brain-to-Text Decoding

Anonymous

Data Contamination Issues in Brain-to-Text Decoding

Anonymous

16 Feb 2024ACL ARR 2024 February Blind SubmissionReaders: Everyone

Abstract: Decoding non-invasive cognitive signals to natural language has long been the goal of building practical brain-computer interfaces (BCIs). Recent major milestones have successfully decoded cognitive signals like functional Magnetic Resonance Imaging (fMRI) and electroencephalogram (EEG) into text under open vocabulary setting. However, how to split the datasets for training, validating, and testing in brain-to-text decoding still remains controversial. Additionally, the issue of data contamination observed in prior research persists. In this study, we undertake a comprehensive analysis on current dataset splitting strategies and discover that data contamination significantly overstates the performance of models. Specifically, first we find the leakage of test subjects' cognitive signals corrupts the training of a robust encoder. Second, we prove the leakage of text stimuli causes the auto-regressive decoder to memorize seen information in test set. To eliminate the influence of data contamination and fairly evaluate different models' generalization ability, we propose a new splitting method for different types of cognitive dataset (e.g. fMRI, EEG). We also evaluate the performance of SOTA brain-to-text decoding models under the proposed dataset splitting paradigm as baselines for further research.

Paper Type: long

Research Area: Generation

Contribution Types: Reproduction study

Languages Studied: English

0 Replies

Loading