# Contextualizing biological perturbation experiments through language

All contents of this code distribution are copyrighted 2024, all rights reserved.

## Installation

The following packages are required to run our evaluation.

```
- scikit-learn
- numpy
- torchmetrics
- (optional, for BERTScore) transformers
```

We have included the PerturbationQA input + label pairs in this code distribution. You may download the additional materials, including knowledge graphs and gene summaries, in the [data distribution](https://zenodo.org/records/13760748?token=eyJhbGciOiJIUzUxMiJ9.eyJpZCI6ImQ2NTU1MTZjLTQ1OTktNGFlZi1hNWE2LTk5ZDRhNzIwMGZjZSIsImRhdGEiOnt9LCJyYW5kb20iOiI4ZmQzNDZlNmZhZGQ1MTAzN2YyM2ZlYjU4ZWNjMGZmNCJ9.Ym9Ws841nq4_KDAFxXg4f7FC55jBCCedCEAyh5q44j3D5834pxUIU2mhZwYytQ2NJfb4kSe9re9gTqXJ68F_PA).

## PerturbQA benchmark

### Differential expression and direction of change

Datasets can be loaded as follows.

```
from pertqa import load_de, load_dir

# options: "k562" "rpe1" "hepg2" "jurkat" "k562_set"
data_de = load_de("k562")
# train/test splits
X_train = data_de["train"]
X_test = data_de["test"]

data_dir = load_dir("k562")
```

To evaluate your predictions:

```
from pertqa import auc_per_gene

keys = [(x["pert"], x["gene"]) for x in X_test]
pred = []  # list / numpy array of floats
true = [x["label"] for x in X_test]  # from load_de/dir
auc = auc_per_gene(keys, pred, true)
```

### Gene set enrichment

Set flag `skip_empty` to skip entries without manual annotation
(defaults to `True`).

```
from pertqa import load_gse

# options: "pert" "gene"
data = load_gse("pert", skip_empty=True)
```

To evaluate your predictions:

```
from pertqa import rouge1_recall

pred = ["hello world"]  # list of predictions
true = ["hello"]  # list of labels, e.g. from load_gse
score = rouge1_recall(pred, true)
```

The `transformers` library is required to compute BERTScore,
and we recommend having access to a GPU.

```
from pertqa import bert_score

pred = ["hello world"]  # list of predictions
true = ["hello"]  # list of labels, e.g. from load_gse
scores = bert_score(pred, true)
```

### Knowledge graphs

Knowledge graphs and gene summaries are available from our [data
distribution](https://zenodo.org/records/13760748?token=eyJhbGciOiJIUzUxMiJ9.eyJpZCI6ImQ2NTU1MTZjLTQ1OTktNGFlZi1hNWE2LTk5ZDRhNzIwMGZjZSIsImRhdGEiOnt9LCJyYW5kb20iOiI4ZmQzNDZlNmZhZGQ1MTAzN2YyM2ZlYjU4ZWNjMGZmNCJ9.Ym9Ws841nq4_KDAFxXg4f7FC55jBCCedCEAyh5q44j3D5834pxUIU2mhZwYytQ2NJfb4kSe9re9gTqXJ68F_PA) under the archives
- `kg.zip`
- `gene_summary.zip`

Please place `kg` under `data/kg` if you wish to run `baselines/gene_set.py`

## Models

### LLMs

Please see `summer` for more details.

- All prompt templates may be found at `summer/prompts`.
- LLM outputs required to reproduce our paper can be found in the [data
  distribution](https://zenodo.org/records/13760748?token=eyJhbGciOiJIUzUxMiJ9.eyJpZCI6ImQ2NTU1MTZjLTQ1OTktNGFlZi1hNWE2LTk5ZDRhNzIwMGZjZSIsImRhdGEiOnt9LCJyYW5kb20iOiI4ZmQzNDZlNmZhZGQ1MTAzN2YyM2ZlYjU4ZWNjMGZmNCJ9.Ym9Ws841nq4_KDAFxXg4f7FC55jBCCedCEAyh5q44j3D5834pxUIU2mhZwYytQ2NJfb4kSe9re9gTqXJ68F_PA), in the archives named:
  - `summer_outputs.zip`
  - `llm-nocot.zip`
  - `llm-noretrieve.zip`

### Baselines

- Code or instructions required to run baselines can be found at `baselines`
- Baselines have their own installation requirements.

## Data attribution
The knowledge graph entries and gene summaries are derived from the following databases:
* **UniProt**  
  UniProt: the Universal Protein Knowledgebase in 2023  
  [Nucleic Acids Res. 51:D523–D531 (2023)](https://academic.oup.com/nar/article/51/D1/D523/6835362?login=true)  
  Made available under the terms of the [CC BY 4.0 license](https://creativecommons.org/licenses/by/4.0/).
* **Ensembl 2024**  
  Nucleic Acids Res. 2024, 52(D1):D891–D899 PMID: 37953337
  [10.1093/nar/gkad1049](https://academic.oup.com/nar/article/52/D1/D891/7416379?login=true)  
  Made available under the terms of the [Apache 2.0 license](https://www.ensembl.org/info/about/legal/code_licence.html)
* **[Gene Ontology data](https://geneontology.org/)**  
  [2024-01-17](http://release.geneontology.org/2024-01-17) release ([DOI:10.5281/zenodo.10536401](https://doi.org/10.5281/zenodo.10536401))  
  Made available under the terms of the [CC BY 4.0 license](https://creativecommons.org/licenses/by/4.0/legalcode).
* **CORUM**  
  CORUM: the comprehensive resource of mammalian protein complexes–2022  
  [Nucleic Acids Research, 51(D1):D539–D545](https://academic.oup.com/nar/article/51/D1/D539/6830667)
  Made available under the terms of the [CC BY NC 4.0 license](https://creativecommons.org/licenses/by-nc/4.0/deed.en)
* **STRINGDB**  
  [Szklarczyk et al. Nucleic acids research 51.D1 (2023): D638-D646](https://pubmed.ncbi.nlm.nih.gov/36370105/)  
  Made available under the terms of the [CC BY 4.0 license](https://creativecommons.org/licenses/by/4.0/legalcode).
* **REACTOME**  
  [The Reactome Pathway Knowledgebase 2024. Nucleic Acids Research. 2024. doi: 10.1093/nar/gkad1025.](https://academic.oup.com/nar/article/52/D1/D672/7369850?login=true&utm_source=advanceaccess&utm_campaign=nar&utm_medium=email)  
  Made available under the terms of the [CC BY 4.0 license](https://creativecommons.org/licenses/by/4.0/legalcode).
* **Bioplex**  
  [Huttlin et al. (2021) Cell 184(11):3022-3040. doi: 10.1016/j.cell.2021.04.011.](https://doi.org/10.1101/2020.01.19.905109)

The PerturbQA Perturb-seq datasets are derived from the following datasets
* **Mapping information-rich genotype-phenotype landscapes with genome-scale perturb-seq**  
  Cell, 185(14):2559–2575.e28, 2022. ISSN 0092-8674. doi:505
  https://www.cell.com/cell/pdf/S0092-8674(22)00597-9.pdf  
    Made available under the terms of the [CC BY 4.0 license](https://creativecommons.org/licenses/by/4.0/).
* **Transcriptome-wide characterization of genetic perturbations.**  
  bioRxiv, 07 2024. doi: [10.1101/2024.07.03.601903](https://www.biorxiv.org/content/10.1101/2024.07.03.601903v1)  
    Made available under the terms of the [CC BY 4.0 license](https://creativecommons.org/licenses/by/4.0/).