Keywords: De Novo Peptide Sequencing, Causality, Protemics
Abstract: \emph{De novo} peptide sequencing is a foundational computational technique in proteomics, which is critical for discovering and characterizing novel peptides and proteins within complex biological systems. To predict peptide sequences directly from tandem mass spectra, mainstream deep learning approaches aim to model the relationship between mass spectra and corresponding peptides. However, these models face significant challenges, particularly under noisy conditions. These deep learning models often capture superficial correlations within noisy spectral data, failing to identify the underlying causal mechanisms that link true signal fragment ions to peptide sequences. Consequently, these models tend to learn spurious associations that cannot generalize in practice, where noise peaks are prone to change due to different co-elutions or chemical contaminants. To tackle this, we introduce CausalNovo, a model-agnostic framework designed to learn the causal representations of mass spectra in peptide sequencing models by focusing on signal fragment ions. Specifically, grounded in two practical and general principles, independence and sufficiency, CausalNovo employs causal interventions and information-theoretic objectives to disentangle causal representations from spurious noise peaks. Extensive experiments on three public datasets show that CausalNovo effectively generalizes across varying Noise Signal Ratios (NSR) and remains relatively stable against non-causal peak changes. Consequently, CausalNovo yields consistent and significant performance gains of up to 10\% in amino acid, peptide, and PTM-level performance. Code is available at https://anonymous.4open.science/r/CausalNovo-C134.
Primary Area: applications to physical sciences (physics, chemistry, biology, etc.)
Submission Number: 11920
Loading