# TopoDiff

## Introduction

This is a minimal runnable code of the paper "TopoDiff: Improve Protein Backbone Generation with Topology-aware Latent Encoding"

With the scripts and weights provided, one can do fixed-length sampling of protein backbones and reproduce the evaluation experiments  section 2 of the result in the paper. We also provide the designability prediction layer that we used in section 3.1.

We apologize that since the work is still under active development, we are not able to provide the full training code and the full evaluation code at this moment. We expect to release a better version by the end of November.

## Installation


We recommend using conda to install the dependencies. As we use the implementation of IPA from OpenFold, we include a forked version of OpenFold in this repo. During the computation of IPA, OpenFold by default uses a self-defined memory-efficient kernel for attention computation, so to faithfully reproduce the results, we recommend further installing OpenFold from our forked directory, although theoretically skipping this step and using the vanilla implementation of attention should not affect the results.

```bash
# create the conda environment
conda env create -n topodiff -f env.yml
conda activate topodiff

# download the weights
cd weight
wget https://anonymous.4open.science/api/repo/workshop_weight-3D12/file/neurips_workshop.pt
cd ..

# (Optional but recommended) install OpenFold from our forked directory
# be sure to have  g++ (>=6.0.0, <12.0) in the environment path
cd myopenFold
python3 setup.py install
cd ..
```

## Usage

### Sampling

```
python run_sampling.py [-h] [-o OUTDIR] [-s START] [-e END] [-i INTERVAL] [-n NUM_SAMPLES] [--min_sc MIN_SC] [--max_sc MAX_SC] [--seed SEED] [--gpu GPU] [--num_k NUM_K] [--epsilon EPSILON]

# e.g.
# python run_sampling.py -o sampling_result -s 100 -e 100 -n 10
```

Arguments:

    -h, --help            show this help message and exit
    -o OUTDIR, --outdir OUTDIR
                        The output directory
    -s START, --start START
                        The start length of sampling, must be larger than 50, default: 100
    -e END, --end END     
                        The end length of sampling (inclusive), must be smaller than 250, default: 100
    -i INTERVAL, --interval INTERVAL
                        The interval of sampling length, default: 10
    -n NUM_SAMPLES, --num_samples NUM_SAMPLES
                        The number of samples to generate for each length, default: 5
    --min_sc MIN_SC       The minimum predicted designability score of the latent, default: 0.0
    --max_sc MAX_SC       The maximum predicted designability score of the latent, default: 10.0
    --seed SEED           The random seed for sampling, default: None
    --gpu GPU             The gpu id for sampling, default: None
    --num_k NUM_K         The number of k to decide the expected length of the latent, default: 1
    --epsilon EPSILON     The range of variation of the expected length of the latent, default: 0.2

The output directory will be arranged as follows:

```
outdir
├── length_100
│   ├── sample_0.pdb
│   ├── sample_1.pdb...
├── length_110
│   ├── sample_0.pdb
│   ├── sample_1.pdb...
...
```

## Acknowledgements

We adapted some codes from [OpenFold](https://github.com/aqlaboratory/openfold), [FrameDiff](https://github.com/jasonkyuyim/se3_diffusion), [diffae](https://github.com/phizaz/diffae) and [progres](https://github.com/jgreener64/progres). We thank the authors for their impressive work.

1. Ahdritz, G., Bouatta, N., Kadyan, S., Xia, Q., Gerecke, W., O’Donnell, T. J., ... & AlQuraishi, M. (2022). OpenFold: Retraining AlphaFold2 yields new insights into its learning mechanisms and capacity for generalization. bioRxiv, 2022-11.
2. Yim, J., Trippe, B. L., De Bortoli, V., Mathieu, E., Doucet, A., Barzilay, R., & Jaakkola, T. (2023). SE (3) diffusion model with application to protein backbone generation. arXiv preprint arXiv:2302.02277.
3. Preechakul, K., Chatthee, N., Wizadwongsa, S., & Suwajanakorn, S. (2022). Diffusion autoencoders: Toward a meaningful and decodable representation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 10619-10629).
4. Greener, J. G., & Jamali, K. (2022). Fast protein structure searching using structure graph embeddings. bioRxiv, 2022-11.


