Latent Imputation before Prediction: A New Computational Paradigm for De Novo Peptide Sequencing

Ye Du; Chen Yang; Nanxi Yu; Wanyu Lin; Qian Zhao; Shujun Wang

Latent Imputation before Prediction: A New Computational Paradigm for De Novo Peptide Sequencing

Ye Du, Chen Yang, Nanxi Yu, Wanyu Lin, Qian Zhao, Shujun Wang

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

TL;DR: A new computational paradigm for addressing the missing fragmentation issue in mass spectrum for de novo peptide sequencing.

Abstract: *De novo* peptide sequencing is a fundamental computational technique for ascertaining amino acid sequences of peptides directly from tandem mass spectrometry data, eliminating the need for reference databases. Cutting-edge models encode the observed mass spectra into latent representations from which peptides are predicted auto-regressively. However, the issue of missing fragmentation, attributable to factors such as suboptimal fragmentation efficiency and instrumental constraints, presents a formidable challenge in practical applications. To tackle this obstacle, we propose a novel computational paradigm called $\underline{\textbf{L}}$atent $\underline{\textbf{I}}$mputation before $\underline{\textbf{P}}$rediction (LIPNovo). LIPNovo is devised to compensate for missing fragmentation information within observed spectra before executing the final peptide prediction. Rather than generating raw missing data, LIPNovo performs imputation in the latent space, guided by the theoretical peak profile of the target peptide sequence. The imputation process is conceptualized as a set-prediction problem, utilizing a set of learnable peak queries to reason about the relationships among observed peaks and directly generate the latent representations of theoretical peaks through optimal bipartite matching. In this way, LIPNovo manages to supplement missing information during inference and thus boosts performance. Despite its simplicity, experiments on three benchmark datasets demonstrate that LIPNovo outperforms state-of-the-art methods by large margins. Code is available at https://github.com/usr922/LIPNovo.

Lay Summary: Peptides, the building blocks of proteins, are crucial for understanding biological processes and developing new therapies. De novo peptide sequencing is a computational technique that determines peptide sequences directly from mass spectrometry data, without relying on existing databases. However, missing data in spectra—caused by suboptimal experimental conditions—makes sequencing challenging. To address this, we developed LIPNovo, a novel method that compensates for missing information in spectral data before predicting peptide sequences. Instead of trying to recreate missing raw data, LIPNovo uses advanced machine learning techniques to fill in the gaps within the model's inherent representations, guided by theoretical knowledge of peptides. This approach improves the quality and reliability of peptide predictions. Our experiments show that LIPNovo significantly outperforms existing methods, making peptide sequencing more accurate. This advancement has the potential to accelerate discoveries in biology, biotechnology, and medicine.

Application-Driven Machine Learning: This submission is on Application-Driven Machine Learning.

Link To Code: https://github.com/usr922/LIPNovo

Primary Area: Applications->Chemistry, Physics, and Earth Sciences

Keywords: De Novo Peptide Sequencing; Mass Spectrum; Latent Space Imputation;

Submission Number: 3227

Loading