LibriBrain: Over 50 Hours of Within-Subject MEG to Improve Speech Decoding Methods at Scale

Miran Özdogan; Gilad Landau; Gereon Elvers; Dulhan Jayalath; Pratik Somaiya; Francesco Mantegna; Mark Woolrich; Oiwi Parker Jones

LibriBrain: Over 50 Hours of Within-Subject MEG to Improve Speech Decoding Methods at Scale

Miran Özdogan, Gilad Landau, Gereon Elvers, Dulhan Jayalath, Pratik Somaiya, Francesco Mantegna, Mark Woolrich, Oiwi Parker Jones

Published: 18 Sept 2025, Last Modified: 30 Oct 2025NeurIPS 2025 Datasets and Benchmarks Track posterEveryoneRevisionsBibTeXCC BY-NC 4.0

Keywords: Data Sets or Data Repositories, Brain--Computer Interfaces and Neural Prostheses, Brain Imaging, Cognitive Science, Neuroscience

TL;DR: LibriBrain is the largest non-invasive MEG dataset (over 50 hours) recorded from a single subject listening to naturalistic speech, designed to advance scalable and reproducible machine learning methods for speech decoding from brain activity.

Abstract: LibriBrain represents the largest single-subject MEG dataset to date for speech decoding, with over 50 hours of recordings---5$\times$ larger than the next comparable dataset and 50$\times$ larger than most. This unprecedented `depth' of within-subject data enables exploration of neural representations at a scale previously unavailable with non-invasive methods. LibriBrain comprises high-quality MEG recordings together with detailed annotations from a single participant listening to naturalistic spoken English, covering nearly the full Sherlock Holmes canon. Designed to support advances in neural decoding, LibriBrain comes with a Python library for streamlined integration with deep learning frameworks, standard data splits for reproducibility, and baseline results for three foundational decoding tasks: speech detection, phoneme classification, and word classification. Baseline experiments demonstrate that increasing training data yields substantial improvements in decoding performance, highlighting the value of scaling up deep, within-subject datasets. By releasing this dataset, we aim to empower the research community to advance speech decoding methodologies and accelerate the development of safe, effective clinical brain-computer interfaces.

Croissant File: json

Dataset URL: https://huggingface.co/datasets/pnpl/LibriBrain

Code URL: https://github.com/neural-processing-lab/libribrain-experiments

Supplementary Material: zip

Primary Area: Data and Benchmarking scenarios in Neuroscience and cognitive science (e.g., neural coding, brain-computer interfaces)

Flagged For Ethics Review: true

Submission Number: 997

Loading