# README

## Environment
```bash
conda env create -f environment.yml
pip install -e .
pip install wandb
wandb offline # to avoid login
```



## Checkpoints

Our protein structure-aware pre-training can be found at https://drive.google.com/drive/folders/1C242kxzFHAU9AdEWdAmmQqNc3wPnnDO3?usp=sharing.



## Datasets

Our HotProtein dataset can be found at https://drive.google.com/drive/folders/1C242kxzFHAU9AdEWdAmmQqNc3wPnnDO3?usp=sharing.

d1 is equivalent to HP-S2C2

d2 is equivalent to HP-S2C5 



## Commands 

Runs for the best models of SAP + FST + Aug on HP-S2C2/C5:

```bash
# Classification on HP-S2C2
for i in $(seq 0 9); do
CUDA_VISIBLE_DEVICES=0 nohup python -u finetune_sup_head_fst_sap.py esm1b_t33_650M_UR50S d1/d1_fasta_clean sup --num_classes 2 --include mean per_tok --toks_per_batch 2048 --idx d1 --lr 1e-3 --rank 4 --lr-factor 10 --split_file d1/d1_${i}_classification.pkl --seed 1 --wandb-name 0510_d1_${i}_adv_1e-6_seed1_GPU5_sap --adv --gamma 1e-6  > 0510_d1_${i}_adv_1e-6_seed1_GPU6_sap.out
done

# Regression on HP-S2C2
for i in $(seq 0 9); do
CUDA_VISIBLE_DEVICES=3 nohup python -u finetune_sup_head_regression_fst_sap.py esm1b_t33_650M_UR50S d1/d1_fasta_clean sup --include mean per_tok --toks_per_batch 2048 --num_classes 2 --idx d1 --lr 1e-3 --rank 4 --lr-factor 10 --split_file d1/d1_${i}.pkl --seed 1 --wandb-name 0510_d1_${i}_adv_1e-6_seed1_GPU5_sap --adv --gamma 1e-6  > 0510_d1_r_${i}_adv_1e-6_seed1_GPU6_sap.out
done

# Classification on HP-S2C5
for i in $(seq 0 9); do
CUDA_VISIBLE_DEVICES=1 nohup python -u finetune_sup_head_fst_sap.py esm1b_t33_650M_UR50S d2/d2_fasta_clean sup --include mean per_tok --toks_per_batch 2048 --num_classes 5 --idx d2 --lr 1e-3 --rank 4 --lr-factor 10 --split_file d2/d2_${i}_classification.pkl --seed 1 --wandb-name 0510_d2_${i}_adv_1e-6_seed1_GPU5_sap --adv --gamma 1e-6  > 0510_d2_${i}_adv_1e-6_seed1_GPU6_sap.out
done

# Regression on HP-S2C5
for i in $(seq 0 9); do
CUDA_VISIBLE_DEVICES=1 nohup python -u finetune_sup_head_regression_fst_sap.py esm1b_t33_650M_UR50S d2/d2_fasta_clean sup --include mean per_tok --toks_per_batch 2048 --num_classes 5 --idx d2 --lr 1e-3 --rank 4 --lr-factor 10 --split_file d2/d2_${i}.pkl --seed 1 --wandb-name 0510_d2_r_${i}_adv_1e-6_seed1_GPU5_sap --adv --gamma 1e-6  > 0510_d2_r_${i}_adv_1e-6_seed1_GPU6_sap.out
done
```



## Acknowledgement

Our codes are developed based on [esm](https://github.com/facebookresearch/esm). 