## Installation

For Train, Sampling and Preparing Simulation use conda env from: `mars.yaml`
For Preprocessing use conda env from: `mars_eval.yaml`

## Download Datasets

1. Download the tetrapeptide MD datasets from [https://github.com/bjing2016/mdgen](https://github.com/bjing2016/mdgen)

2. Download the MD-Cath dataset simulations via [https://github.com/bjing2016/alphaflow/blob/master/scripts/download_atlas.sh](https://huggingface.co/datasets/compsciencelab/mdCATH)

## Prepare Dataset Trajectories

Prepare the tetrapeptide simulations,
```
python -m scripts.prepare.prep_sims --splits splits/4AA.csv --sim_dir data/4AA_sims --outdir data/4AA_sims --num_workers [N] --suffix _i100 --stride 100
```

Prepare the MDGen simulations,
```
python -m scripts.prepare.prep_sims_mdcath --splits splits/mdCATH.txt --sim_dir data/md_cath --outdir data/md_cath_processed --num_workers [N]
```

## Preprocess MSM clusters

For tetrapeptides,
```
python -m scripts.preprocess.msm_preprocess --train_split splits/4AA_train.csv --val_split splits/4AA_val.csv --data_dir data/4AA_sims --suffix _i100 --num_workers [N] --num_pcca_states 10
```

For MD-Cath,
```
python -m scripts.preprocess.msm_preprocess_mdcath_gr --fraction 1/1 --temp 450 --features gr,secondary --n_clusters 10 --data_dir data/md_cath --cluster_data_dir data/md_cath_processed --error_domains_file ./splits/erroneous_domains.txt
```

## Training

For tetrapeptides
```
python train_msm.py --sim_condition --train_split splits/4AA_train.csv --val_split splits/4AA_val.csv --data_dir ./data/4AA_sims --prepend_ipa --abs_pos_emb --crop 4 --ckpt_freq 100 --val_repeat 25 --suffix _i100 --epochs 2001  --wandb --run_name 4AA_msm_save/c10_lag200 --num_workers 4 --cluster_sampling_mode original --x0_sampling_mode cluster_based_v2 --num_pcca_states 10 --msm_lagtime 200
```

For MD-Cath
```
python train_msm.py --sim_condition --train_split splits/mdCATH_train.csv --val_split splits/mdCATH_val.csv --data_dir ./data/md_cath_processed --batch_size 8 --prepend_ipa --crop 256 --val_repeat 5 --epochs 1000 --mdcath --ckpt_freq 25 --wandb --run_name [run_name] --msm_num_clusters 10 --num_clusters 2 --num_samples_per_cluster 12 --x0_sampling_mode cluster_based_v2 --msm_lagtime 50 --no-msm_vampnet --msm_observables gr,secondary --msm_merge_replicas --no-msm_include_single_state --cluster_sampling_mode original --data_temperature 450
```



## Sampling

For hierachical sampling of MarS only
```
python sim_inference_2models_1toMany.py \\
  --sim_ckpt_mdgen ${mdgen_ckpt} \\
  --sim_ckpt_msm ${checkpoint} \\
  --data_dir data/md_cath  \\
  --num_rollouts 1 \\
  --calls_mdgen 0 \\
  --calls_msm 200 \\
  --initial_calls_mdgen 0 \\
  --num_tree_rollouts 2 \\
  --max_msm_samples 500 \\
  --tree_parallel_chunk 100 \\
  --split 'splits/mdCATH_test.csv' \\
  --out_dir ${out_dir} \\
  --mdcath \\
  --suffix '' \\
  --do_not_overwrite \\
  --temp 450 \\
  --seed 42
```

For MarS + MDGen sampling
```
python sim_inference_2models_1toMany.py \\
  --sim_ckpt_mdgen ${mdgen_ckpt} \\
  --sim_ckpt_msm ${checkpoint} \\
  --data_dir data/md_cath  \\
  --num_rollouts 1 \\
  --calls_mdgen 1 \\
  --calls_msm 50 \\
  --initial_calls_mdgen 1 \\
  --split 'splits/mdCATH_test.csv' \\
  --out_dir ${out_dir} \\
  --mdcath \\
  --suffix '' \\
  --do_not_overwrite \\
  --temp 450
```

## Analysis

We run analysis scripts that produce a pickle file in each sample directory. `$dir` in the next is where you want to store your directories/

For Tetrapeptides:

```
python -m scripts.analyze_peptide_sim --mddir ./data/4AA_sims --plot --save --num_workers 50 --pdbdir $dir
python -m scripts.read_pkl --dir $dir
```


For MD-Cath

```
python -m scripts.analysis.alphaFlow_analysis --pdbdir "$dir" --num_workers [N] --xtc --truncate 500 --temp 450 --no_distributional --no_observable
python -m scripts.analysis.analyze_peptide_sim_mdcath --pdbdir "$dir" --msm_lag 50 --num_workers  [N] --truncate 500 --temp 450 --notica
python -m scripts.analysis.complete_read_pkl $dir
```

## Baselines
For MDGen training and sampling please refer to: [https://github.com/bjing2016/mdgen](https://github.com/bjing2016/mdgen)
