# Code Supplement for "Towards Open-Search De Novo Peptide Sequencing via Mass-Based Zero-Shot Learning"

This repository contains the supplementary code for the paper "Towards Open-Search De Novo Peptide Sequencing via Mass-Based Zero-Shot Learning," submitted to NeurIPS 2025. The work addresses the limitations of current deep learning-based de novo peptide sequencing (DNPS) models in handling post-translational modifications (PTMs) due to their reliance on fixed vocabularies. We propose a novel approach that reformulates DNPS as a continuous mass prediction problem, leveraging mass as a generalizable feature for zero-shot learning of unseen PTMs. To facilitate generalization, an adversarial multi-task learning (MTL) scheme is employed, combining experimental and simulated spectra during training.

## Project Structure

The code is organized into two main branches, representing different model configurations discussed in the paper. Please note, that both branches build upon the publicly available [Casanovo code](https://github.com/Noble-Lab/casanovo) (Yilmaz et al., 2024):

**MTL_Model**: This directory contains the implementation of the Multi-Task Learning (MTL) model. This model is trained on a balanced mixture of simulated and experimental spectra to improve generalization to unseen PTMs.

**GAN_Model**: This directory contains the implementation of the Multi-Task Learning (MTL) model extended with the adversarial approach. This model incorporates a generative adversarial network (GAN)-inspired framework to align representations of experimental and simulated spectra in a shared latent space, encouraging domain-invariant encodings. The code is build upon the MTL code and simply extended with the necessary modules.

The casanovo sub-directory within each model branch contain the following files. We only provide a brief description of the most important files relevant to our model's functionality:

- `casanovo/`
  - `config.yaml`: The hyper parameter configurations used to train the models
  - `data/`:
    - Data is handled through `denovo/dataloaders.py` and `denovo/datasets.py`
  - `denovo/`:
    - `dataloaders.py`: Data loading utilities for training and evaluation. Here, the simulated and experimental data are loaded jointly.
    - `datasets.py`: Implementations for the experimental and simulated datasets. To optimize disk usage, we merged the original Casanovo splits into a single lance dataset and index the spectra based on a pre-defined split (i.e., the original Casanovo train/val/test split). However, the code allows to use an arbitrary split, without doubly storing the data.
    - `model_runner.py`: Script to handle the model training setup.
    - `model.py`: Definition of the DNPS model. Here, for example, the model's loss function is defined.
    - `tokenizer.py`: We added a `MassAwarePeptideTokenizer` class to handle the tokenization of peptide sequences. This tokenizer is designed to work with the mass-based approach, and will ignore any tokens defined in the model configuration that share the same mass and can thus not be distinguished.
    - `transformers.py`: Implementation of mass regression decoder. This is the core of the model and is based partially on the concepts presented in ContraNovo (Jin et al., 2024)
  - `gan/` (only in GAN Model):
    - `discriminator.py`: Implementation of the adversarial discriminator.
    - `model.py`: Combines the MTL model with the GAN discriminator. This also includes the loss function for the GAN model.

## Setup

To run the code, it is recommended to set up the provided conda environment. We include the environment.yaml that we used to run all experiments. This file contains all necessary dependencies (and more) to run the code, run evaluations and more. You can create the conda environment using the following command:

```Bash
conda env create -f environment.yaml
conda activate mass_based_dnps
```

## Usage

To allow for reproducibility, we provide the code used for training and running the MTL and GAN models.

### Training

To train a model (e.g., the GAN-inspired MTL model or the basic MTL model), we ran the following commands on our slurm cluster. Adjust the arguments to match the correct file paths (i.e., `config.yaml` for either model). Note again, that we merged the original Casanovo splits into a single lance dataset and index the spectra based on a pre-defined split. Merging into one lance dataset can be achieved by providing all splits as input to the model. Based on that, the indices can be generated (e.g., by filtering for the source file <=> train/val/test split):

```Bash
#!/bin/bash
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=8
#SBATCH --gres=gpu:a40:1
#SBATCH --mem=60GB
#SBATCH --job-name=train_mass_based_dnps
#SBATCH --output=Training/logs/%x.txt

config_path="path/to/config.yaml"

lance_path="path/to/mskb_casanovo_data/combined/spectra_82c0124b_combined.lance"

output_path="Training/logs/train_mass_based_dnps"

casanovo train -c $config_path -o $output_path -p $lance_path $lance_path
```

### Evaluation

To evaluate our trained models, we ran them in teacher forcing mode, meaning that the model is provided with the ground truth sequence during inference (shifted by one position) - this is the same setting as used during training. By using teacher forcing, we can ensure that previous mistakes do not propagate through the model and that predicted masses align with the ground truth sequence:

To obtain the mass predictions for the MSV-V1 test set, we used the following code:

```Bash
python run_model_teacher_forcing.py
```

Based on these predictions, we than generated all plots and numerical results in the paper.

## References

- Melih Yilmaz, William E Fondrie, Wout Bittremieux, Carlo F Melendez, Rowan Nelson, Varun
  Ananth, Sewoong Oh, and William Stafford Noble. Sequence-to-sequence translation from
  mass spectra to peptides with a transformer model. Nature communications, 15(1):6427, 2024.
  Publisher: Nature Publishing Group UK London.
- Zhi Jin, Sheng Xu, Xiang Zhang, Tianze Ling, Nanqing Dong, Wanli Ouyang, Zhiqiang Gao,
  Cheng Chang, and Siqi Sun. Contranovo: A contrastive learning approach to enhance de
  novo peptide sequencing. In Proceedings of the AAAI Conference on Artificial Intelligence,
  volume 38, pages 144–152, 2024.
