# Endowing Protein Language Models with Structural Knowledge

This is the code for Protein Structure Transformer.


## Requirements

The following requirements are necessary for running our experiments. Python 3.10 is also necessary.

```
biopandas==0.4.1
black==23.7.0
deepdiff==6.3.1
easydict==1.10
einops==0.6.0
fair-esm==2.0.0
fastapi==0.101.0
fastavro==1.8.2
hydra-core==1.3.2
isort==5.12.0
joblib==1.2.0
lightgbm==4.1.0
lightning==2.0.6
matplotlib==3.7.2
networkx==3.0
numpy==1.24.1
omegaconf==2.3.0
openpyxl==3.1.2
pandas==2.0.1
proteinshake==0.3.13
pyg-lib==0.2.0+pt20cu118
pyprojroot==0.3.0
rdkit-pypi==2022.9.5
rich==13.5.2
scikit-learn==1.2.2
scikit-multilearn==0.2.0
scipy==1.10.1
tensorboard==2.12.2
torch==2.0.0+cu118
torch-cluster==1.6.1+pt20cu118
torch-geometric==2.3.0
torch-scatter==2.1.1+pt20cu118
torch-sparse==0.6.17+pt20cu118
torch-spline-conv==1.2.2+pt20cu118
torchaudio==2.0.1+cu118
torchdrug==0.2.1
torchmetrics==1.0.3
torchvision==0.15.1+cu118
tqdm==4.65.0
```

## Instructions for running PST model

Run `source s` first.

#### Pretraining

Run `train_pst.py` to pretrain the PST model on AlphaFold's Swissprot database of 520M structures. Once trained, please place the models in `pretrained/${model}` where `${model}` indicates the model name.

#### Finetuning

Run `predict_*.py` to train a classification head on the fixed representations from the PST model  oneach specific benchmark datasets.

Run `finetune_*.py` to funetune the full model on EC, GO and Fold classification datasets.
