Towards Open-Search De Novo Peptide Sequencing via Mass-Based Zero-Shot Learning

12 May 2025 (modified: 29 Oct 2025)Submitted to NeurIPS 2025EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Adversarial Networks, Attention Models, Computational Biology and Bioinformatics, Multitask and Transfer Learning, One-Shot/Low-Shot Learning Approaches, Regression
TL;DR: We propose a mass-based reformulation of the de novo peptide sequencing problem that enables zero-shot prediction of unseen post-translational modifications, demonstrating a proof of concept for open-search DNPS using adversarial multi-task learning.
Abstract: Proteins are the main drivers of biochemical processes and play a pivotal role in almost all cellular functions. Through post-translational modifications (PTMs), residues within a protein can be chemically modified to fine-tune the protein's function in the cellular context. Despite the importance of PTMs, the plethora of deep learning-based de novo peptide sequencing (DNPS) models, which, in contrast to database searching approaches, predict peptide sequence solely from tandem mass spectra without any reference organism database, can only predict peptide sequences with a limited set of PTMs. This is because they rely on fixed vocabularies that map residue tokens to non-generalizable learned embeddings. To overcome this limitation, we propose a novel approach that leverages the fact that amino acids and their derivatives are characterized by their mass, a generalizable feature that enables zero-shot learning. Specifically, we reformulate DNPS as a mass prediction problem instead of a multiclass classification problem, where the model predicts the mass of the next residue instead of its token representation. To facilitate generalization to unseen PTMs, we leverage an adversarial multi-task learning scheme by supplementing the training data of experimental spectra with simulated spectra that mimic spectra containing unseen residues. We show that our approach allows the prediction of previously unseen PTMs, providing a promising proof of concept for mass-based representations as a path towards true open-search DNPS.
Supplementary Material: zip
Primary Area: Applications (e.g., vision, language, speech and audio, Creative AI)
Submission Number: 27987
Loading