# KeAP
This is the implementation for Protein Representation Learning via Knowledge Enhanced Primary Structure Reasoning. Our proposed Knowledge-exploited Auto-encoder for Proteins (KeAP) performs implicit knowledge encoding by learning to exploit knowledge for protein primary structure reasoning.

## Requirements
The environment for pre-training are

python3.8 / pytorch 1.10.0 / transformer 4.5.1+ / deepspeed 0.6.5/ lmdb /

You also need to replace the deepspeed.py file in the transformer library with the deepspeed.py file in replace_code/.

## Dataset
Please download the dataset for pre-training [here](https://drive.google.com/file/d/1iTC2-zbvYZCDhWM_wxRufCvV6vvPk8HR/view). The dataset consists of (Protein, relation, Attribute) knowledge triplets.

## Pre-training
To pretrain KeAP, you need to download [ProtBERT](https://huggingface.co/Rostlab/prot_bert) and [PubMedBERT](https://huggingface.co/microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext). Edit the paths of dataset and models in the pre-training script at script/run_pretrain.sh and run the following script to start training.

> sh script/run_pretrain.sh
