# Code for the submitted paper Protein Language Model Embeddings Improve Generalization of Implicit Transfer Operators


# Environment Setup
```bash
conda create -n platito python=3.11
conda activate platito
pip install -r requirements.txt
```

# Dataset Setup

First, you need to prepare a dataset directory.
Create a folder (e.g. `dataset_dir`) with the following structure:

1. `dataset_dir/mdCATH/data` contains the MD trajectories of mdCATH downloaded from huggingface.
2. `dataset_dir/mdCATH/esmc_6b_200res.pt` contains the ESMC embeddings precomputed for every sequence in the dataset.
3. `dataset_dir/mdCATH/metadata.json` contains additional metadata for each protein like:
    - the amino-acid sequence
    - classification and confidence values from DeepSeek 
4. To train the PLaTITO+Struct and PLaTITO+Struct+LLM models, we should also have a folder `proteina` inside `dataset_dir` with:
    - `proteina/proteina_v1.3_DFS_60M_notri.ckpt`: checkpoint of pretrained 60M model downloaded from Proteina repo.
    - `proteina/pdb_raw/cath_label_mapping.pt`: Mapping from CATH code to an index (integer) downloaded from Proteina repo.

# Training

Training is tracked by wandb. By overriding the main config `train.yaml` we can train different TITO variants like:

- TITO

```bash
python scripts/train.py -cn train paths.project_data_dir=dataset_dir model/nn@model.structure_net=cond_net_tito logger.project=platito logger.name=tito
```

- PLaTITO

```bash
python scripts/train.py -cn train paths.project_data_dir=dataset_dir model/nn@model.structure_net=cond_net_platito logger.project=platito logger.name=platito
```

- PLaTITO+Struct

```bash
python scripts/train.py -cn train paths.project_data_dir=dataset_dir model/nn@model.structure_net=cond_net_platitoStruct logger.project=platito logger.name=platito_struct
```

- PLaTITO+Struct+LLM

```bash
python scripts/train.py -cn train  paths.project_data_dir=dataset_dir model/nn@model.structure_net=cond_net_platitoStructLLM logger.project=platito logger.name=platito_struct_llm
```

# Inference

To generate trajectories, we can use the `generate_fast_folders.py` script and again override the values in the corresponding `generate_fast_folders.yaml` file. In this case, `dataset_dir` should also contain a directory `fast_folders` with metadata for the dataset and you should provide the trained model in wandb to load its weights from there:

```bash
python scripts/generate_fast_folders.py -cn generate_fast_folders  paths.project_data_dir=dataset_dir protein_name=A3D start_frames=unfolded samples_per_iteration=1000 step=1 number_of_steps=1000 wandb.run_id=XXXXX wandb.entiy=ABCD
```


