Pep2Prob Benchmark: Predicting Fragment Ion Probability for MS$^2$-based Proteomics

20 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Fragment Ion Probability Prediction, Tandem Mass Spectrometry, MS/MS Mass Spectrum, Proteomics
TL;DR: Pep2Prob is the first dataset and benchmark for predicting peptide-specific fragment probabilities in tandem mass spectrometry, going beyond simplified global fragmentation statistics for peptide identification.
Abstract: Proteins perform nearly all cellular functions and constitute most drug targets, making their analysis fundamental to understanding human biology in health and disease. Tandem mass spectrometry (MS$^2$) is the major analytical technique in proteomics that identifies peptides by ionizing them, fragmenting them, and using the resulting mass spectra to identify and quantify proteins in biological samples. In MS$^2$ analysis, peptide fragment ion probability prediction plays a critical role, enhancing the accuracy of peptide identification from MS$^2$ spectra as a complement to the intensity information. Current approaches rely on global statistics of fragmentation, which assumes that a fragment's probability is uniform across all peptides. Nevertheless, this assumption is oversimplified from a biochemical principle point of view and limits accurate prediction. To address this gap, we present **Pep2Prob**, the first comprehensive dataset and benchmark designed for peptide-specific fragment ion probability prediction. The proposed dataset contains fragment ion probability statistics for 608,780 unique precursors (each precursor is a pair of peptide sequence and charge state), summarized from more than 183 million high-quality, high-resolution, HCD MS$^2$ spectra with validated peptide assignments and fragmentation annotations. We establish baseline performance using simple statistical rules and learning-based methods, and find that models leveraging peptide-specific information significantly outperform previous methods using only global fragmentation statistics. Furthermore, performance across benchmark models with increasing capacities suggests that the peptide-fragmentation relationship exhibits complex nonlinearities requiring sophisticated machine learning approaches. Pep2Prob provides a standardized evaluation framework that will accelerate algorithmic innovation in computational proteomics while introducing a biologically significant prediction task to the machine learning community.
Primary Area: datasets and benchmarks
Submission Number: 23663
Loading