This repository provides the source code and data for the paper "Disconnecting The Dots: Creating Leakage-Free Protein Datasets by Sparse Removal of Densely Connected Data Points".

A refactored version of this code will be publicly released in the following weeks.


# Dataset splits

The original DeepFRI splits and the novel data splits introduced in the paper can be found under ```/data```. 
The file names follow the convention ```deepfri_<classifiaction>_<similarity metric>_<lowest threshold>-<highest threshold>_n_<total clusters for val and test>.csv```. 
Files with "stats" in their file names report the proportions of proteins in the different set, with the "Train diff to 80" column indicating how close the training set to the target 80% ratio (with 10% for validation, and 10% for test).


# Creating new splits
As a preliminary step, you should install [MMseqs](https://github.com/soedinglab/MMseqs2) and [Foldseek](https://github.com/steineggerlab/foldseek?tab=readme-ov-file#installation) locally.

### 1. Compute pairwise similarity metrics
The initial step involves calculating all-against-all pairwise similarities.

If your similarity metric is sequence identity, run ```scripts/run_mmseqs.sh```.
Otherwise, if the similarity metric is the TM score, run ```scripts/run_foldseek.sh```.

The parameters controlling MMseqs and Foldseek can be changed in the corresponding scripts.

### 2. Running community-based clustering
Change the path to the input directories according to your setup in ```dataset/run_community_detection.py```, then do ```python dataset/run_community_detection.py```.

The scripts should take a few minutes for a dataset of about 30,000 data points.


### 3. Creating data splits
Change the path to the input directories according to your setup in ```dataset/make_data_splits.py```. The variable ```n_out``` controls the total number of clusters obtained at each similarity threshold, depending on the desired train/val/test split ratios. We recommend trying several values of ```n_out```, as the proportion of the final splits cannot be known a priori. However, the script should only take a few minutes for a given value of ```n_out``` and a few input seeds.

After adapting the parameters to your case, do ```python dataset/make_data_splits.py```.


### Evaluating learned representations
The input representations need to be precomputed and saved in a specific folder. We provide a few examplar configuration files for training models depending on the input representation and data split.

The models can be trained by running ```python ddots/train.py --config-name <config_file.yaml>```. 


