Time Series Machine Learning for Classifying Electroencephalograms

Published: 18 Mar 2026, Last Modified: 18 Mar 2026Accepted by DMLREveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Electroencephalography (EEG) is a crucial tool across neuroscience domains, including medical diagnostics, psychological research, and brain-computer interfacing (BCI). Its popularity is due to its non-invasiveness, high temporal resolution, and cost-effectiveness. The task of EEG classification, which involves learning to predict class labels associated with EEG segments based on previously observed data. This task is fundamental yet complex, given the high dimensionality, variability, and subject-specific nuances inherent in EEG data. We systematically evaluate recent advances in general-purpose time series machine learning (TSML) approaches to EEG classification. We present an EEG classification archive of 30 benchmark datasets, spanning diverse applications from clinical diagnostics to cognitive and BCI tasks. Our empirical evaluation compares traditional EEG approaches, deep learning models, Riemannian geometry-based classifiers, and state-of-the-art time series machine learning algorithms on this new benchmark. We find that one algorithm, a meta ensemble called HIVE-COTE v2.0, consistently outperforms alternative classifiers.
Certifications: Dataset Certification
Keywords: Time Series Machine Learning, EEG, EEG Classification, time series classification
Changes Since Last Submission: Thank you for the detailed Action Editor comments. We have revised the manuscript to address each requested point through additional experiments, expanded analysis, and clearer discussion, while keeping the paper focused on its central contribution as a reproducible EEG data and evaluation resource. We summarise the main revisions below. 1. Stronger comparison against EEG-specific baselines We expanded the benchmark to include widely used EEG-specific deep learning baselines and now report their performance and runtime under the same protocol as the other methods. In particular, we added EEGNet and DeepConvNet, two standard EEG CNN baselines available in the maintained Braindecode implementation, supporting the benchmark’s reproducibility aims (Section 3.2, Table 2, and the results tables). The Action Editor also mentioned ShallowConvNet and transformer-based models such as MedFormer. We did not include ShallowConvNet because it is mainly a shallower, cheaper variant within the same family as DeepConvNet rather than a fundamentally different modelling approach. Since the revised benchmark already includes both EEGNet and DeepConvNet, we judged that ShallowConvNet would add limited methodological coverage relative to the extra experimental cost. For transformers, we explain why selecting and fairly configuring a single standard baseline across 30 heterogeneous EEG datasets is not straightforward. MedFormer, for example, is released with a dataset-specific preprocessing and training pipeline rather than as a drop-in baseline for arbitrary EEG archives. Applying it fairly across our benchmark would require substantial additional preprocessing and design choices, weakening the aim of maintaining a uniform and reproducible evaluation protocol. We therefore cite MedFormer as relevant recent work but do not evaluate it in this revision, instead positioning transformer-style EEG models as a natural next extension enabled by the released archive and evaluation code (Section 3.2 and Conclusions). 2. Runtime analysis We added a systematic runtime analysis reporting both fit time and prediction time. The main results now include average fit and prediction times for each classifier alongside the predictive metrics (Table 3), and we provide dataset-level fit times to show variation across problems and practical scaling behaviour (Appendix Table 8). We also added a dedicated HC2 runtime subsection explaining the main cost drivers and outlining practical speed-ups, including parallelism, pruning, and contracting (Section 5.1). 3. Preprocessing and interpretation We strengthened the methodological discussion in two ways. First, we clarify that the datasets are used as released by the original authors, with no additional preprocessing applied by us, and discuss how differences in preprocessing, epoching, channel selection, and subject or session structure can affect comparability and potential bias across datasets (Section 2). Second, we expanded the analysis of HC2 through a deconstruction of its constituent representations (Section 5). We also added targeted case studies where non-TSML baselines outperform HC2 by at least 10 percentage points, including confusion matrices and ROC analyses that help clarify when spatial or covariance structure is especially important (Tables 4-6 and Figure 5). We remain cautious about making strong neurophysiological claims across heterogeneous datasets. 4. Out-of-the-box comparison versus fine-tuning We now discuss tuning explicitly in Section 4. We emphasise that the benchmark is intended to provide a fair and reproducible out-of-the-box comparison on fixed train/test splits. We also note that extensive tuning can be problematic on small training sets and difficult to standardise fairly across very different model families. For that reason, we do not introduce broad tuning regimes in this revision. Instead, we clarify that tuning, regularisation, augmentation, transfer learning, and related data-efficiency strategies are important directions for future work, and that the released archive is intended to support such studies. 5. More balanced positioning of HC2 We revised the framing of the results to avoid over-emphasising HC2. In the revised manuscript, HC2 is presented as the best overall performer on this benchmark, but also as a method with substantial computational cost. We place greater emphasis on trade-offs, highlighting MRHydra as a practical alternative with a better speed-accuracy balance, and showing explicitly that some datasets favour domain-specific baselines such as CSP-SVM, R-MDM, and DeepConvNet. This makes it clearer that the paper does not argue for a single universally best EEG classifier, but instead provides a benchmark that reveals where different approaches are strong or weak. 6. References We performed a full reference audit. We hope these revisions address the requested minor corrections
Code: https://github.com/aeon-toolkit/aeon-neuro
Assigned Action Editor: ~Fernando_Perez-Cruz1
Submission Number: 123
Loading