{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Instructions For Training, Testing and Visualizing Results with PSALM"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Requirements"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "1. Requirements to reproduce the Python environment needed are listed in `requirements.yml` (the complete list for training/test is listed in `full_requirements.yml`)\n",
    "2. HMMER can be downloaded from [hmmer.org](http://hmmer.org/), which also contains a link to the documentation/user manual.\n",
    "3. ESM-2 can be installed from the [ESM GitHub repository](https://github.com/facebookresearch/esm?tab=readme-ov-file#repostart), which contains the documentation and installation guide.\n",
    "4. UniProt release March 2021 can be downloaded from the [UniProt ftp site](https://ftp.uniprot.org/pub/databases/uniprot/previous_major_releases/release-2021_03/knowledgebase/knowledgebase2021_03.tar.gz)\n",
    "5. The Pfam-35.0 HMM database can be downloaded from the [Pfam ftp site](https://ftp.ebi.ac.uk/pub/databases/Pfam/releases/Pfam35.0/Pfam-A.hmm.gz)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Datasets"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Training/Validation/Test Sequences"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "1. The `Train.txt` file contains the IDs of all the sequences used for training the PSALM model\n",
    "2. The `Val.txt` file contains the IDs of all the sequences used as a validation set\n",
    "3. Test sets are organized by max PID as described in the manuscript. The files corresponding to each PID range (0-20, 20-40, 40-60, 60-80, 80-100) are named `Test_20.txt`, `Test_40.txt`, `Test_60.fasta`, `Test_80.txt` and `Test_100.txt` respectively"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Training/Validation/Test Ground Truth Annotations"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Ground truth annotations are determined via the `hmmscan` tool from HMMER. The following command will :\n",
    "\n",
    "`hmmscan --acc -o <output_dir> <hmm_db> <path_to_fasta_file>`"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To facilitate quicker training, we save the targets for each sequence in a pickle file through the following steps on the HMMER output:\n",
    "1. Use `parse_hmmscan_results` function from the file `hmmscan_utils.py` to get a dictionary of results\n",
    "2. Use `generate_position_list` function from the file `hmmscan_utils.py` to get the targets for family and clan membership\n",
    "3. Save the targets as keys `fam_vector` and `clan_vector` for each sequence label"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Note**: Large sequence databases need to be sharded in bins of approximately 10,000 to facilitate easy computation of the HMMER outputs and saving targets using the above approach. A simple dictionary merging approach can be used to combine all output pickle files into appropriate `Train_targets.pkl`, `Val_targets.pkl` or `Test_<PID>_targets.pkl` files."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Training"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The training script can be run with the following command:\n",
    "`python psalm_train.py -o <name_of_results_directory>`\n",
    "\n",
    "To view all available commandline options, run:\n",
    "`python psalm_train.py -h`\n",
    "Alternatively, go through `parsers.py`"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Testing"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The training script can be run with the following command:\n",
    "`python psalm_test.py -o <name_of_results_directory>`\n",
    "\n",
    "To view all available commandline options, run:\n",
    "`python psalm_test.py -h`\n",
    "Alternatively, go through `parsers.py`"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Evaluation Metrics"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "1. All metrics data corresponding to model comparisons in the manuscript are stored in the `metrics` folder.\n",
    "2. For custom, the following command can be run after test predictions are created, to save metrics information:\n",
    "`python eval.py <name_of_results_directory>`"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.7"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
