## PPI CANDIDATE RANKING: LARGE-SCALE EVALUATION OF A DOMAIN KNOWLEDGE-GUIDED PIPELINE
Computational approaches have become central to Protein–Protein Interaction (PPI) research, but experimental validation of predicted interactions remains costly and incomplete. To address this challenge, we introduce the problem of PPI candidate ranking: given a target protein and its known partners, the task is to prioritize novel candidates most likely to interact with it.

Our framework leverages the interpretability of deep sequence-based models (D-SCRIPT and Topsy-Turvy) to guide retrieval, exploiting internal representations of known interactions to rank new ones. We further refine these rankings by integrating complementary signals.

## Repository Structure
The repository is organized into four main folders and one root configuration file:
```
ppi_candidate_ranking/
├── candidate_retrieval/ # retrieval pipeline
├── data_preprocessing/ # Scripts for preparing STRING datasets (v11, v12, FASTA filtering)
├── evaluation/ # Evaluation scripts for rediscovery, recommendation, and metrics
├── re_ranking/ # pipeline for re-ranking top-k
├── config_base.py # Global configuration (paths, parameters)
├── README.md # documentation
├── requirements.txt # necessary libraries
```

## How to Execute Code
### Step 0. Install Dependencies
We recommend using Python 3.9 or later. All dependencies are listed in `requirements.txt`.

To install them, run:

```bash
pip install -r requirements.txt
```

### Step 1. Download Data
All data is taken from the official STRING database, specifically:
- STRING v11 (https://version-11-0.string-db.org/cgi/download.pl)
- STRING v12 (https://string-db.org/cgi/download)

We use files about general protein information, alias mapping, full interaction network and FASTA sequences. Specifically, to make the code work you first need to download them and place into the respective paths in `config_base.py` and in `data_preprocessing/config.py`.

The variables to be filled are:
- V11:
    - ALIAS_V11, with "9606.protein.info.v11.0.txt"
    - v11_fasta, with "9606.protein.sequences.v11.0.fa"

- V12:
    - ALIAS_V12, with "9606.protein.info.v12.0.txt"
    - v12_fasta, with "9606.protein.sequences.v12.0.fa"
    - STRING_FILE, with "9606.protein.physical.links.detailed.v12.0.txt"
    - CDHIT_CLUSTER_FILE, with "clustered_9606.clstr"

Note: V11_FILE, is the dataset of interactions used to train DSCRIPT and downloaded directly from the official repository (https://github.com/samsledje/D-SCRIPT).

### Step 2. Get the Models
This work relies on pre-trained models and a sequence embedding backbone. Please, from DSCRIPT official repository, download the files requested below and place them in the corresponding global variables in configuration of candidate_retrieval subfolder.
- DSCRIPT_EMBED, with the path to dscript/commands/embed.py 
- DSCRIPT_INTERACTION, with the path to dscript/commands/predict_interaction.py
- DSCRIPT_MODEL, with human_v1.sav
- TOPSY_MODEL, with topsy_turvy_model.sav

### Step 3. Preprocess data
At this point, the v12 interaction dataset is constructed and aligned with v11. We preprocess it to obtain a clean set of novel interactions, ensure that only proteins with valid FASTA sequences are retained, and generate embeddings for downstream retrieval.

The preprocessing involves the following scripts:

- `prepare_string_dataset.py`: constructs the processed STRING v12 dataset, normalizing aliases, links, and sequences.

- `get_new_pairs.py`: identifies pairs that appear in v12 but not in v11, defining the novel interaction set.

- `filter_fasta.py`: filters out proteins without valid sequences and retains in v11 only pairs in which both entries have valid FASTA formats

- Embedding generation: produces embeddings for all valid proteins, by using the official command of DSCRIPT (once with DSCRIPT as main model, and once with Topsy-Turvy). Commands can be found at https://d-script.readthedocs.io/en/main/usage.html. Example:
    ```bash
    dscript embed --seqs [sequences.fa] --outfile [embeddings.h5]
    ```

### Step 4. Candidate Retrieval
Once the datasets are prepared, we move to the retrieval stage. Here, the goal is to rediscover known partners and recover novel interactions by leveraging embeddings and interpretability signals. This phase includes preparing the v11 baseline set, extracting known partners, running similarity-based retrieval, and evaluating rediscovery/recommendation performance.

The retrieval involves the following scripts:
- `main.py`: performs similarity-based retrieval using protein embeddings and interpretability guidance.

- `evaluate_rediscovery.py`: evaluates how well novel v12 interactions are rediscovered in the retrieval stage.

- `evaluate_recommendation.py`: assesses the quality of global rankings and recommendation lists.

- `compute_metrics.py`: computes aggregate evaluation metrics across rediscovery and recommendation tasks.

### Step 5. Re-Ranking
After the initial retrieval, we refine candidate lists by applying multiple re-ranking strategies. The goal is to improve prioritization of plausible interaction partners by combining structural, semantic, and language-model signals.

The re-ranking involves the following components:

- `generate_topk_expansion.py` and `generate_topk_pairs.py`: for each protein, retain only top-10 partners.

- D-SCRIPT IS: compute interaction scores (IS) for candidate pairs using the original D-SCRIPT model and the commands directly from the official documentation:
    ```bash
    dscript predict --pairs [list of pairs] --embeddings [embedding file] --outfile [outfile] --model [model file]
    ```

- pDockQ (SpeedPPI folder): structural scoring of candidate pairs using the SpeedPPI implementation of pDockQ. This can be evaluated by using the `run_speedppi.sh` file.

- LLMs: large language models used to re-rank candidates based on textual/functional knowledge. We have:
    - `llm_re_ranking.py` and `llm_prompt_re_ranking.py` for BioBERT and BioMedRoBERTa
    - `test_ce_ppi.py`, for PubMedBERT

- Semantic: lightweight heuristics based on functional annotations and free-text summaries. We use `rerank_with_annotations.py`

- `integrate_and_rank.py`: integrates all re-ranking signals into a unified ranked list and evaluates improvements over the baseline retrieval.

Notice: The code is modular and can be run step-by-step once the datasets are downloaded.