# ProtEx

This repository contains anonymized code related to the paper "ProtEx: A Retrieval-Augmented Approach for Protein Function Prediction".

## Installation

It is then recommended to setup a virtual environment. We provide an example
using `conda`:

```shell
conda create -n protex python=3.10
conda activate protex
```

Then install dependencies specified in `setup.py`:

```shell
pip install .
```

## Overview

The code, along with the released model predictions, support reproducing the main results from the paper. 

The code is organized as follows:

* `blast/` - Contains conversion scripts for reproducing BLAST results.
* `common/` - Some common utility libraries.
* `data/` - Contains conversion scripts for various datasets to a common format.
* `eval/` - Contains tools for computing various evaluation metrics.

We convert datasets to a common format consisting of newline separated json files, where each has the following keys:

*   `sequence` - String of protein sequence.
*   `accession` - String for unique identifier, e.g. UniProt accession.
*   `labels` - List of strings for labels, e.g. EC numbers.

## Usage Examples

### ProteInfer

Here we provide a usage example focused on reproducing the results for the ProteInfer dataset for the clustered EC split. Conversion and evaluation scripts for other datasets
can be found in `data/` and `/eval`, and usages are similar.

The [original dataset](https://google-research.github.io/proteinfer/) is available on GCP at `gs:///brain-genomics-public/research/proteins/proteinfer/datasets/swissprot/`. We can set our input to the path to the EC clustered test split:

```shell
CLUSTERED_EC_TEST_TFR="gs://brain-genomics-public/research/proteins/proteinfer/datasets/swissprot/clustered/test.tfrecord"
```

We will assume that the variable `DATA_DIR` is set to readable and writable
directory, such as `DATA_DIR=/tmp/`.

We can then run the data conversion script:

```shell
CLUSTERED_EC_TEST_JSONL="${DATA_DIR}/proteinfer_clustered_ec_test.jsonl"
python -m data.convert_proteinfer \
--alsologtostderr \
--input=${CLUSTERED_EC_TEST_TFR} \
--output=${CLUSTERED_EC_TEST_JSONL} \
--labels=ec
```

Model predictions for ProtEx on all test splits are available in the `predictions` sub-folder. Specifically, the clustered EC predictions are here:

```
PREDS_PROTEX=predictions/proteinfer-clustered-ec-test-jsonl
```

We can then reproduce the max micro-averaged F1 metrics reported for this split with the following script:

```shell
python -m eval.eval_micro_f1 \
--alsologtostderr \
--dataset=${CLUSTERED_EC_TEST_JSONL} \
--predictions=${PREDS_PROTEX}
```

We also released BLAST predictions, so the above script can also be used with the following `--predictions` argument to reproduce the reported BLAST results:

```
PREDS_BLAST=predictions/proteinfer-clustered-ec-test-jsonl
```

#### Reproducing BLAST

We also released code to reproduce the BLAST predictions. For this we need to also convert the ProteInfer training set:

```shell
CLUSTERED_EC_TRAIN_TFR="gs://brain-genomics-public/research/proteins/proteinfer/datasets/swissprot/clustered/train.tfrecord"
CLUSTERED_EC_TRAIN_JSONL="${DATA_DIR}/proteinfer_clustered_ec_train.jsonl
python -m data.convert_proteinfer \
--alsologtostderr \
--input=${CLUSTERED_EC_TRAIN_TFR} \
--output=${CLUSTERED_EC_TRAIN_JSONL} \
--labels=ec
```

We then need to convert both train and test splits to `.fasta` format:

```shell
CLUSTERED_EC_TRAIN_FASTA="${DATA_DIR}/proteinfer_clustered_ec_train.fasta
python -m blast.convert_to_fasta \
--alsologtostderr \
--input=${CLUSTERED_EC_TRAIN_JSONL} \
--output=${CLUSTERED_EC_TRAIN_FASTA}

CLUSTERED_EC_TEST_FASTA="${DATA_DIR}/proteinfer_clustered_ec_test.fasta
python -m blast.convert_to_fasta \
--alsologtostderr \
--input=${CLUSTERED_EC_TEST_JSONL} \
--output=${CLUSTERED_EC_TEST_FASTA}
```

Note that if `DATA_DIR` refers to a GCP bucket rather than a local directory, the files may need to be copied locally so that they can be read by the BLAST command line tool before proceeding to the next step. We will assume `BLAST_DIR` is set to the location of the BLAST binaries,
e.g. `BLAST_DIR=".../ncbi-blast-2.14.1+/bin"`.

We can then run BLAST.

```shell
BLAST_TSV="${DATA_DIR}/blast_proteinfer_clustered_ec_test.tsv"
${BLAST_DIR}/makeblastdb -in ${CLUSTERED_EC_TRAIN_FASTA} -dbtype prot
${BLAST_DIR}/blastp -query ${CLUSTERED_EC_TEST_FASTA} -db ${CLUSTERED_EC_TRAIN_FASTA} -outfmt 6 -max_hsps 1 -num_threads 16 -max_target_seqs 1 -out ${BLAST_TSV}
```

Finally, we can convert the tsv file generated by BLAST to the standard predictions format we are using:

```shell
BLAST_JSONL=${DATA_DIR}/blast_proteinfer_clustered_ec_test.jsonl
python -m blast.convert_blast \
--alsologtostderr \
--input=${BLAST_TSV} \
--database_records=${CLUSTERED_EC_TRAIN_FASTA} \
--output=${BLAST_JSONL}
```

## License and disclaimer

Copyright 2024 The ProtEx Authors. All rights reserved.

All software is licensed under the Apache License, Version 2.0 (Apache 2.0);
you may not use this file except in compliance with the Apache 2.0 license.
You may obtain a copy of the Apache 2.0 license at:
https://www.apache.org/licenses/LICENSE-2.0

All other materials are licensed under the Creative Commons Attribution 4.0
International License (CC-BY). You may obtain a copy of the CC-BY license at:
https://creativecommons.org/licenses/by/4.0/legalcode

Unless required by applicable law or agreed to in writing, all software and
materials distributed here under the Apache 2.0 or CC-BY licenses are
distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND,
either express or implied. See the licenses for the specific language governing
permissions and limitations under those licenses.

