PROSPECT: Labeled Tandem Mass Spectrometry Dataset for Machine Learning in ProteomicsDownload PDF

06 Jun 2022, 13:46 (modified: 11 Oct 2022, 19:49)NeurIPS 2022 Datasets and Benchmarks Readers: Everyone
Keywords: Proteomics, Deep Learning, Machine Learning, Dataset, Mass Spectrometry, Retention Time, Annotated Spectra, Neutral Losses, ProteomeTools, Fragment Ions, Intensity
TL;DR: The paper introduces a labeled tandem Mass Spectrometry dataset for machine learning in proteomics and recommends evaluation metrics.
Abstract: Proteomics is the interdisciplinary field focusing on the large-scale study of proteins. Proteins essentially organize and execute all functions within organisms. Today, the bottom-up analysis approach is the most commonly used workflow, where proteins are digested into peptides and subsequently analyzed using Tandem Mass Spectrometry (MS/MS). MS-based proteomics has transformed various fields in life sciences, such as drug discovery and biomarker identification. Today, proteomics is entering a phase where it is helpful for clinical decision-making. Computational methods are vital in turning large amounts of acquired raw MS data into information and, ultimately, knowledge. Deep learning has proved its success in multiple domains as a robust framework for supervised and unsupervised machine learning problems. In proteomics, scientists are increasingly leveraging the potential of deep learning to predict the properties of peptides based on their sequence to improve their confident identification. However, a reference dataset is missing, covering several proteomics tasks, enabling performance comparison, and evaluating reproducibility and generalization. Here, we present a large labeled proteomics dataset spanning several tasks in the domain to address this challenge. We focus on two common applications: peptide retention time and MS/MS spectrum prediction. We review existing methods and task formulations from a machine learning perspective and recommend suitable evaluation metrics and visualizations. With an accessible dataset, we aim to lower the entry barrier and enable faster development in machine learning for proteomics.
Supplementary Material: pdf
Dataset Url: Dataset on Zenodo: Repo for auxiliary code:
License: Dataset license: Creative Commons Attribution 4.0 International Supplementary code license: MIT License
Author Statement: Yes
Contribution Process Agreement: Yes
In Person Attendance: Yes
8 Replies