Protein Language Model–Aligned Spectra Embeddings for De Novo Peptide Sequencing

Navid NaderiAlizadeh; Christian Dallago; Erik J. Soderblom; Scott H Soderling

Protein Language Model–Aligned Spectra Embeddings for De Novo Peptide Sequencing

Navid NaderiAlizadeh, Christian Dallago, Erik J. Soderblom, Scott H Soderling

19 Sept 2025 (modified: 22 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: de novo peptide sequencing, tandem mass spectrometry, protein language models, constrained learning

Abstract: We consider the problem of *de novo* peptide sequencing in tandem mass spectrometry, where the goal is to predict the underlying peptide sequence given a spectrum's fragment peaks and precursor information. We present PLMNovo, a constrained learning framework that leverages pre-trained protein language models (PLMs) to guide the training process. In particular, we cast peptide-spectrum matching as a constrained optimization problem that enforces alignment between spectrum and peptide embeddings produced by a spectrum encoder and a PLM, respectively. We use a Lagrangian primal-dual algorithm to train the spectrum encoder and the peptide decoder by solving the proposed constrained learning problem, while optionally fine-tuning the pre-trained PLM. Through numerical experiments on established benchmarks, we demonstrate that PLMNovo outperforms several state-of-the-art deep learning-based *de novo* sequencing algorithms.

Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)

Submission Number: 20300

Loading