# Potts_Inverse_Folding
Repository reproducing some of the codes of the paper "Fast uncovering of protein sequence diversity from structure".
## Main packages required and versions used
Find all the packages and their versions in the file **libraries.txt**.
## Dataset creation
If one wants to just test the pretrained models, obviously this step is not necessary, as we provide a couple in-sample and out-of-sample structures to experiment with in the **Data_Subset** folder. This section will require roughly $120Gbs$ of memory while the execution time depends on the CPU resources avaiable to [MMseqs2](https://github.com/soedinglab/MMseqs2). We ran it on _intel I9-13900K/KF 5.8 Ghz_, and this step took roughly a day to complete. The codes can be found in the folder *data_creation*. The requirements to run the code are:
- Download the _Uniref50_ dataset which can be found on the [UniProt website](https://www.uniprot.org/help/downloads).
- Download the [CATH](http://download.cathdb.info/cath/releases/latest-release/non-redundant-data-sets/) 4.2 40% non redundant dataset. 
- Install the the [MMseqs2](https://github.com/soedinglab/MMseqs2) library and [esm](https://github.com/facebookresearch/esm) repository.

Once this is done, and the necessary dataset and repositiory are placed in paths compatible with the following files one has to:
1. run the bash file _create_msas.sh_ to create all the necessary MSA for the different structures
2. run the python code _train_test_split.py_ to split the different MSAs into the train test split outlined
3. Run the python notebook _get_numerical_MSA.ipynb_ to get all the MSA's in numerical format, so that they are ready to use by the model and don't have to be converted at every update step of the training. 
4. Run the python notebook _get_encodings.ipynb_ to get the __ESM-IF1__ pretrained encodings to feed to one of our Potts decoders; 

## Training models

In the training folder one can find two files, to train respectively the standard pairwise potts model and the autoregressive potts model. If the steps detailled in the **Dataset creation** section have been performed properly, these code should run directly(provided one has all the necessary libraries reported in **libraries.txt**).

We trained our model on a single **NVIDIA GeForce RTX 3090**, having $24$ Gbs of memory. Training time obviously depends on the choice of the GPU, but also where the data is stored during training. In our case, we could not store all the data on RAM, hence we loaded every batch during training from a local SSD disk. As a default we set the hyperparameters of the model to those reported in the manuscript, and for $94.0$ epochs. The training of the standard potts model took roughly $10$ hours, while the training of the autoregressive potts model took about $24$ hours.

## Test
To run the functions/files in this section, one has ether have trained its model from scratch, or can download the pretrained __InvMSAFold_PW__ and __InvMSAFold_AR__ by running the notebook **load_models.ipynb**.

In this folder one can find the codes to generate the samples from the model considered in the manuscript. To do so one has to download the [bmDCA](https://github.com/ranganathanlab/bmdca) library to generate efficiently MCMC samples from the __InvMSAFold_PW__ model. All the auxiliary folder to run the samplers should already be present. The [ESM](https://github.com/facebookresearch/esm) repository is already in the current folder.


By running the code **sample.py** one can generate the desired amount of samples from the three different models. To silence some of the models in order to generate only from a subset of them, look at the global variables defined in the model. Currently the samples of __ESM-IF1__ are automatically aligneed, if this is not the desired behaviour one can just comment the line which alignes the samples, and store the un-aligned samples. To compute covariances, leverage the function **compute_covariance** in file **/util/test_utils.py**.

 ## Util folder

 As the name suggests, in this folder we have defined all those function which allow for a modular code in all the other folders above reported. The functions inside this folder should be well explained/commented. 
