{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# MolPAL Tutorial\n",
    "\n",
    "Welcome to the MolPAL tutorial! In this notebook, we'll go over the general workflow involved with performing a run of MolPAL, some tips and tricks for how to set things up, as well as some advice on good optimization parameters and how to analyze the outputs"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Table of contents\n",
    "- [Setting up your environment](#setting-up-your-environment)\n",
    "- [How to use MolPAL](#how-to-use-molpal)\n",
    "- [MolPAL workflow](#molpal-workflow)\n",
    "    1. [Utilizing distributed resources](#1-utilizing-distributed-resources)\n",
    "    2. [Preparing the virtual library](#2-preparing-the-virtual-library)\n",
    "    3. [Library featurization](#3-library-featurization)\n",
    "    4. [Defining the run configuration](#4-defining-the-run-configuration)\n",
    "    5. [Running MolPAL (again)](#5-running-molpal)\n",
    "    6. [Analyzing the results](#6-analyzing-the-results)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Setting up your environment\n",
    "\n",
    "Setting up a virtual environment to run MolPAL using `conda` is fairly simple using the supplied [`environment.yml`](./environment.yml) file. *Note*: you may need to alter the `pytorch` lines depending on whether you will be working a CUDA-enabled system.\n",
    "```bash\n",
    "$ conda env create -f environment.yml\n",
    "$ conda activate molpal\n",
    "$ pip install . --upgrade\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## How to use MolPAL\n",
    "\n",
    "The sections below describe some high-level and low-level steps to follow prior to actually running MolPAL. Because these sections will sometimes refer to some command line arguments used in MolPAL, we will first introduce the general run command first.\n",
    "\n",
    "After your environment has been set up and the `molpal` library is installed, you can run molpal from the command line like so:\n",
    "```bash\n",
    "$ molpal run [...]\n",
    "```\n",
    "There are a lot of options that you can specify when running MolPAL, some more imortant than others, and you can see them all by running the cell below."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "! molpal run --help"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To make argument specification easier when running MolPAL, you can supply all arguments via a config file using YAML syntax. For more details on the config file, see [here](https://goo.gl/R74nmi)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## MolPAL workflow\n",
    "\n",
    "In the following sections, we'll go over the general steps for using MolPAL in a prospective worklow."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1. Utilizing distributed resources\n",
    "\n",
    "MolPAL is able to leverage distributed hardware allocations to parallelize each step in the optimization loop: (i) model training, (ii) prediction, and (iii) docking. To do so, it uses a [`ray`](ray.io) cluster on the backend. Before running MolPAL (step 5) or prefeaturizing the library (step 3), you must start a ray cluster. The setup details vary depending on your resource allocation, but there are two broad cases to consider:\n",
    "\n",
    "### (A) local parallelization\n",
    "\n",
    "This case is fairly simple. To start a local ray cluster, run the following command:\n",
    "```bash\n",
    "$ ray start --head\n",
    "```\n",
    "from the command line. This will start a ray cluster on your machine with all available resources on your machine. If you would like to restrict MolPAL to only using some fraction of the resources on your machine, start the ray cluster like so:\n",
    "```bash\n",
    "$ ray start --head --num-cpus NUM_CPUS --num-gpus NUM_GPUS\n",
    "```\n",
    "Where `NUM_CPUS` and `NUM_GPUS` are the absolute number of CPUs and GPUs you'd like MolPAL to have access to during its run.\n",
    "\n",
    "### (B) distributed parallelization on an HPC\n",
    "The `ray` libary allows MolPAL to be written such that the specific hardware setup makes no difference to the code/parallelization architecture of the source code. The only difference this causes is in the definition of the ray cluster. On an HPC, we may easily get access to multiple compute nodes over which to run a MolPAL job. Utilizing each of these nodes is very similar to the local cluster setup from above. We first start a ray cluster on the \"head\" node and then start a ray cluster on each worker node and connect these resources to the \"head\" node. The specific syntax of this proces will vary on the scheduling system of your HPC, but the concepts remain the same. Below is an excerpt from a SLURM submission script that connects all the nodes in a job to a single ray cluster. You can see this in the context of the full submission script [here](./run_molpal.batch)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "```bash\n",
    "# SLURM doesn't have a convenient SLURM_NUM_GPUS_ON_NODE environment variable, so we have to hack one together\n",
    "export NUM_GPUS=$( echo $CUDA_VISIBLE_DEVICES | awk -F ',' '{print NF}' )\n",
    "\n",
    "# optional, make a private redis password to protect your ray cluster from opportunists on your cluster\n",
    "export redis_password=$( uuidgen 2> /dev/null )\n",
    "\n",
    "# get the node names in this job\n",
    "nodes=$( scontrol show hostnames $SLURM_JOB_NODELIST )\n",
    "nodes_array=( $nodes )\n",
    "\n",
    "# get the IP address of the head node\n",
    "node_0=${nodes_array[0]} \n",
    "ip=$( srun -N 1 -n 1 -w $node_0 hostname --ip-address )\n",
    "\n",
    "# get an open port\n",
    "# NOTE: this will very rarely fail in the case that the port is assigned between the time python\n",
    "# tells you the port is open and the time you actually go to use it. If this happens, just resubmit\n",
    "port=$( python -c 'import socket; s=socket.socket(); s.bind((\"\", 0)); print(s.getsockname()[1]); s.close()' )\n",
    "\n",
    "# set the address of the head node for worker nodes\n",
    "export ip_head=$ip:$port\n",
    "echo \"IP Head: $ip_head\"\n",
    "\n",
    "# start the \"head\" ray cluster\n",
    "# NOTE: by default, ray will assume it has access to all *visible* CPUs and GPUs on a given node.\n",
    "# This is not actually true, so we restrict it to using only as many CPUs and GPUs as were assigned\n",
    "# per node in this job\n",
    "srun -N 1 -n 1 -w $node_0 \\\n",
    "    ray start --head --node-ip-address=$ip --port=$port --redis-password=$redis_password \\\n",
    "    --num-cpus $SLURM_CPUS_ON_NODE --num-gpus $NUM_GPUS --block > /dev/null 2>& 1 &\n",
    "# it doesn't hurt to sleep just to ensure the head cluster is up and running before going forward\n",
    "sleep 10\n",
    "\n",
    "# start the \"worker\" ray clusters on every _other_ node in the allocation\n",
    "worker_num=$(( $SLURM_JOB_NUM_NODES - 1 ))\n",
    "for ((  i=1; i<=$worker_num; i++ )); do\n",
    "    node_i=${nodes_array[$i]}\n",
    "    echo \"STARTING WORKER $i at $node_i\"\n",
    "    srun -N 1 -n 1 -w $node_i \\\n",
    "        ray start --address $ip_head --redis-password=$redis_password \\\n",
    "        --num-cpus $SLURM_CPUS_ON_NODE --num-gpus $NUM_GPUS --block > /dev/null 2>& 1 &\n",
    "    sleep 1\n",
    "done\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The ray cluster paradigm also makes it possible to use cloud-based compute resources like AWS or GCP. For more on that, check out this [this page](https://docs.ray.io/en/master/cluster/vms/user-guides/launching-clusters/index.html) in the ray documentation."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2. Preparing the virtual library\n",
    "A virtual library in the context of MolPAL is a simple CSV file containing the SMILES string of each compound in your library, like so:\n",
    "```\n",
    "$ cat library.csv\n",
    "smiles\n",
    "C1=C(C2=C(C=C1O)OC(C(C2=O)=O)C3=CC=C(C(=C3)O)O)O\n",
    "O=S(=O)(N1CCNCCC1)C2=CC=CC=3C2=CC=NC3\n",
    "C=1C=C2S/C(/N(CC)C2=CC1OC)=C\\C(=O)\n",
    "C=1C=C2S/C(/N(CC)C2=CC1OC)=C\\C(=O)\n",
    "C=1(N=C(C=2C=NC=CC2)C=CN1)NC=3C=C(NC(C4=CC=C(CN5CCN(CC5)C)C=C4)=O)C=CC3C\n",
    "C1=CC=2C(=CNC2C=C1)C=3C=CN=CC3\n",
    "C1=CC=2C(=CNC2C=C1)C=3C=CN=CC3\n",
    "N1(C2=C(C(N)=NC=N2)C=N1)C3=CC=CC=C3\n",
    "CC(=O)CC(=O)C\n",
    "```\n",
    "\n",
    "As you might have noticed, `library.txt` contains some duplicate compounds. This can result in inefficiencies during the batch selection step, so we should remove these before running MolPAL:\n",
    "```\n",
    "$ uniq library.csv > library.dedup.csv\n",
    "$ cat library.dedup.csv\n",
    "smiles\n",
    "C1=C(C2=C(C=C1O)OC(C(C2=O)=O)C3=CC=C(C(=C3)O)O)O\n",
    "O=S(=O)(N1CCNCCC1)C2=CC=CC=3C2=CC=NC3\n",
    "C=1C=C2S/C(/N(CC)C2=CC1OC)=C\\C(=O)\n",
    "C=1(N=C(C=2C=NC=CC2)C=CN1)NC=3C=C(NC(C4=CC=C(CN5CCN(CC5)C)C=C4)=O)C=CC3C\n",
    "C1=CC=2C(=CNC2C=C1)C=3C=CN=CC3\n",
    "N1(C2=C(C(N)=NC=N2)C=N1)C3=CC=CC=C3\n",
    "CC(=O)CC(=O)C\n",
    "```\n",
    "\n",
    "This library may now be supplied to MolPAL from the command line like so:\n",
    "```\n",
    "$ molpal run --library library.dedup.csv\n",
    "```\n",
    "\n",
    "In some cases, you may be searching within a library that has been split into several files (\"shards\") to limit the size of any one file. Using these sharded libraries is no different, just supply them all on the command line:\n",
    "```\n",
    "$ molpal run --library library_0.csv library_1.csv ... library_N.csv [OTHER_ARGS]\n",
    "```\n",
    "**NOTE**: all libraries must have the same format! That is, the SMILES strings must all be the same column of the CSV and they must all have (or not have) a title line\n",
    "\n",
    "Lastly, if your library contains CXSMILES strings, just add the `--cxsmiles` flag on the command line."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Another thing to consider in the impact of tautomers on docking results. It's possible that some of these compounds (maybe not these ones specifically) exist in an ensemble of tautomeric states, some of which bind preferentially to our protein of interest. Unfortunately, given that SMILES strings correspond to only one tautomer, we might miss them. We will need to enumerate all possible tautomers we would like to include, but we should be selective in which tautomers we enumerate first to avoid blowing up the size of our library. As a hypothetical example, let's consider that we have a number of 1,3-diketones in our library. These compounds can undergo a keto-enol tautomerization to form an intramolecular H-bonding interaction, like so:\n",
    "\n",
    "![proton_shift](./assets/proton-shift.png)\n",
    "\n",
    "This would rigidify our molecule's conformational flexibility and potentially lead to more favorable binding compared to the native 1,3-diketone. The cell below (1) searches the libary for 1,3-diketones, (2) enumerates tatuomers of these compounds, (3) keeps only the 1,3-β-hydroxy-ketones, (4) appends these enumerated tautomers to the end of the library."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "creating the library...\n",
      "here it is!\n",
      "smiles\n",
      "C1=C(C2=C(C=C1O)OC(C(C2=O)=O)C3=CC=C(C(=C3)O)O)O\n",
      "O=S(=O)(N1CCNCCC1)C2=CC=CC=3C2=CC=NC3\n",
      "C=1C=C2S/C(/N(CC)C2=CC1OC)=C\\C(=O)\n",
      "C=1(N=C(C=2C=NC=CC2)C=CN1)NC=3C=C(NC(C4=CC=C(CN5CCN(CC5)C)C=C4)=O)C=CC3C\n",
      "C1=CC=2C(=CNC2C=C1)C=3C=CN=CC3\n",
      "N1(C2=C(C(N)=NC=N2)C=N1)C3=CC=CC=C3\n",
      "CC(=O)CC(=O)C\n"
     ]
    }
   ],
   "source": [
    "%%bash\n",
    "echo creating the library...\n",
    "echo \"smiles\n",
    "C1=C(C2=C(C=C1O)OC(C(C2=O)=O)C3=CC=C(C(=C3)O)O)O\n",
    "O=S(=O)(N1CCNCCC1)C2=CC=CC=3C2=CC=NC3\n",
    "C=1C=C2S/C(/N(CC)C2=CC1OC)=C\\C(=O)\n",
    "C=1(N=C(C=2C=NC=CC2)C=CN1)NC=3C=C(NC(C4=CC=C(CN5CCN(CC5)C)C=C4)=O)C=CC3C\n",
    "C1=CC=2C(=CNC2C=C1)C=3C=CN=CC3\n",
    "N1(C2=C(C(N)=NC=N2)C=N1)C3=CC=CC=C3\n",
    "CC(=O)CC(=O)C\" > library.csv\n",
    "echo \"here it is!\"\n",
    "cat library.csv"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "import csv\n",
    "from rdkit import Chem\n",
    "from rdkit.Chem.MolStandardize import rdMolStandardize\n",
    "\n",
    "enumerator = rdMolStandardize.TautomerEnumerator()\n",
    "diketone_13 = Chem.MolFromSmarts(\"O=C-C-C=O\")\n",
    "beta_hydroxy_ketone = Chem.MolFromSmarts(\"[OH]-C=C-C=O\")\n",
    "\n",
    "tauts = []\n",
    "with open(\"library.csv\", \"r\") as fid:\n",
    "    reader = csv.reader(fid)\n",
    "    next(reader)    # consume the title line\n",
    "\n",
    "    for row in reader:\n",
    "        smi = row[0]\n",
    "        mol = Chem.MolFromSmiles(smi)\n",
    "        if not mol.HasSubstructMatch(diketone_13):\n",
    "            continue\n",
    "        \n",
    "        tauts.extend(\n",
    "            [t for t in enumerator.Enumerate(mol) if t.HasSubstructMatch(beta_hydroxy_ketone)]\n",
    "        )\n",
    "\n",
    "extra_smis = [Chem.MolToSmiles(t) for t in tauts]\n",
    "with open(\"library.csv\", \"a\") as fid:\n",
    "    writer = csv.writer(fid)\n",
    "    writer.writerows([[smi] for smi in extra_smis])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "here's the library with the new tautomers!\n",
      "smiles\n",
      "C1=C(C2=C(C=C1O)OC(C(C2=O)=O)C3=CC=C(C(=C3)O)O)O\n",
      "O=S(=O)(N1CCNCCC1)C2=CC=CC=3C2=CC=NC3\n",
      "C=1C=C2S/C(/N(CC)C2=CC1OC)=C\\C(=O)\n",
      "C=1(N=C(C=2C=NC=CC2)C=CN1)NC=3C=C(NC(C4=CC=C(CN5CCN(CC5)C)C=C4)=O)C=CC3C\n",
      "C1=CC=2C(=CNC2C=C1)C=3C=CN=CC3\n",
      "N1(C2=C(C(N)=NC=N2)C=N1)C3=CC=CC=C3\n",
      "CC(=O)CC(=O)C\n",
      "CC(=O)C=C(C)O\n"
     ]
    }
   ],
   "source": [
    "%%bash\n",
    "echo \"here's the library with the new tautomers!\"\n",
    "cat library.csv"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 3. Library featurization\n",
    "\n",
    "Both RF and NN models expect vector-based inputs (e.g., molecular fingerprints) for training and prediction. During startup, MolPAL will calculate the molecular fingerprint of each molecule in the library and store it in an HDF5 file to be reused throughout the optimization. However, if you plan to run multiple experiments with the same library, this featurization will be performed for each individual experiment. To avoid this, we can prefeaturize a virtual library using the [`fingerprints.py`](./scripts/fingerprints.py) script. This script utilizes a ray cluster (see [above](#1-utilizing-distributed-resources) on how to set one up) to parallelize this calculation. The output of the script is the same format of HDF5 file that molpal will use to store the molecular fingerprints. Additionally, it validates the SMILES strings in your library and outputs a list of invalid lines. Both of these may be supplied to a molpal run like so:\n",
    "```\n",
    "$ molpal run --fps OUTPUT_FILE --invalid-lines INVALID_LINES\n",
    "```\n",
    "The `fingerprints.py` may be invoked like so:\n",
    "```\n",
    "$ python scripts/fingerprints.py --help\n",
    "usage: fingerprints.py [-h] [-o OUTPUT]\n",
    "                       [--fingerprint {morgan,pair,rdkit,maccs}]\n",
    "                       [--radius RADIUS] [--length LENGTH] -l LIBRARY\n",
    "                       [LIBRARY ...] [--no-title-line]\n",
    "                       [--total-size TOTAL_SIZE] [-d DELIMITER]\n",
    "                       [--smiles-col SMILES_COL]\n",
    "\n",
    "optional arguments:\n",
    "  -h, --help            show this help message and exit\n",
    "  -o OUTPUT, --output OUTPUT\n",
    "                        the filepath under of the output fingerprints HDF5\n",
    "                        file. Will add '.h5' to the suffix. If no\n",
    "                        name is provided, output file will be named\n",
    "                        <library>.h5\n",
    "  --fingerprint {morgan,pair,rdkit,maccs}\n",
    "                        the type of encoder to use\n",
    "  --radius RADIUS       the radius or path length to use for fingerprints\n",
    "  --length LENGTH       the length of the fingerprint\n",
    "  -l LIBRARY [LIBRARY ...], --library LIBRARY [LIBRARY ...]\n",
    "                        the files containing members of the MoleculePool\n",
    "  --no-title-line       whether there is no title line in the library file\n",
    "  --total-size TOTAL_SIZE\n",
    "                        (if known) the total number of molecules in the\n",
    "                        library file\n",
    "  -d DELIMITER, --delimiter DELIMITER\n",
    "                        the column separator in the library file\n",
    "  --smiles-col SMILES_COL\n",
    "                        the column containing the SMILES string in the library\n",
    "                        file\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 4. Defining the run configuration\n",
    "\n",
    "MolPAL is controlled by two broad parameter sets: (A) the optimization hyperparameters and (B) the docking parameters.\n",
    "\n",
    "**A. Optimization hyperparameters**\n",
    "\n",
    "It's impossible to know what optimization setup will be ideal for a given task. In the initial study, we looked at broad sets of parameters across a range of tasks to identify a default set of hyperparameters appropriate for most optimizations:\n",
    "- a directed message passing neural network surrogate model: `--model mpn`\n",
    "- mean-variance estimation uncertainty quantification: `--conf-method mve`\n",
    "- UCB acquisition metric: `--metric ucb`\n",
    "- between 6-10 batches of exploration: decide on a total sample budget, and divide this by anywhere from 6 to 10, this number will be both your `--init-size` and `--batch-size` and the divisor you chose will be the `--max-iters`.\n",
    "\n",
    "We could get more in the weeds here about how we setup hyperparameters: specifying the learnings rates of the MPNN model, using a different `beta` value for UCB, scheduling the batch size, etc. However, most of these parameters will be problem specific, so it's impossible to recommend anything other the default values we've tested so far. \n",
    "\n",
    "**B. Docking parameters**\n",
    "\n",
    "These settings are entirely dependent on a user's given target, and we refer a reader to [this guide][bender_guide_2021] from Bender et al. on how to set up a docking protol for virtual screening.\n",
    "\n",
    "To use docking with molpal, we supply two things to the molpal argument vector: `--objective docking` and `--objective-config PYSCREENER_CONFIG`, where `PYSCREENER_CONFIG` is the filepath of a configuration file of a `pyscreener` run and typically looks like so:\n",
    "```\n",
    "$ cat pyscreener_config.ini\n",
    "screen-type = vina\n",
    "metadata-template = {\"software\": \"vina\"}\n",
    "receptors = [path/to/receptor.pdb]\n",
    "\n",
    "center = [CENTER_X, CENTER_Y, CENTER_Z]\n",
    "size = [SIZE_X, SIZE_Y, SIZE_Z]\n",
    "\n",
    "ncpu = 4\n",
    "```\n",
    "The core of this file lies in the (a) the value of `receptors` which are the filepath(s) of a receptors in PDB format into which your compounds will be docked and (b) the `center` and `size` parameters, which define your the x-, y-, and z-centers and radii of your docking box, respectively. You may also use other vina-type docking software, such as Smina, PSOVina, or QVina2 by specifying them in the `\"software\"` value. It's also possible to DOCK6. To learn more about this file, see the [pyscreener repo](https://github.com/coleygroup/pyscreener).\n",
    "\n",
    "[bender_guide_2021]: https://www.nature.com/articles/s41596-021-00597-z \"Bender, B.J., Gahbauer, S., Luttens, A. et al. A practical guide to large-scale docking. Nat Protoc 16, 4799–4832 (2021). https://doi.org/10.1038/s41596-021-00597-z\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Putting all of these together, a typical MolPAL config file will look like so:\n",
    "```\n",
    "output_dir: my_molpal_run\n",
    "--write-intermediate\n",
    "--write-final\n",
    "--retrain-from-scratch\n",
    "\n",
    "pool: lazy\n",
    "library: path/to/library.csv\n",
    "\n",
    "model: mpn\n",
    "conf-method: mve\n",
    "\n",
    "metric: ucb\n",
    "beta: 2\n",
    "init-size: 0.002\n",
    "batch-size: 0.002\n",
    "\n",
    "objective: docking\n",
    "objective-config: path/to/pyscreener_config.ini\n",
    "--minimize\n",
    "\n",
    "top-k: 0.01\n",
    "window-size: 3\n",
    "max-iters: 9\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 5. Running MolPAL\n",
    "Assuming the sample config file from the previous cell is saved to `config.yaml` and we've completed all the steps above, we're ready to go!\n",
    "\n",
    "```bash\n",
    "$ molpal run --config config.yaml\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 6. Analyzing the results"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Upon completion of the MolPAL run, the last thing to do is use the results. The output of a MolPAL run will look like so:\n",
    "```\n",
    "$ tree my_molpal_run -F\n",
    "my_molpal_run/\n",
    "├── all_explored_final.csv  <--- CSV containing SMILES and negative docking score for each compound, ordered by docking score (higher is better!)\n",
    "├── chkpts/ <--- the checkpoint for each iteration\n",
    "├── config.ini <--- the config used to rerun this experiment\n",
    "├── data/ <--- the data of each iteration in exploration order\n",
    "├── *.tar.gz    <--- zipped tarballs containing docking inputs/outputs\n",
    "└── extended.csv <--- CSV containing SMILES, docking score, name, and node used to run the compound's docking calculation \n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Typically, docking score alone won't be sufficient to select molecules for further simulation or experimentation, we'll need to extract the poses of the compounds from our run by running molpal in `extract` mode like so:\n",
    "```\n",
    "$ molpal extract --help\n",
    "usage: molpal extract [-h] [-n NAME] parent_dir k\n",
    "\n",
    "positional arguments:\n",
    "  parent_dir            the root directory of a molpal run\n",
    "  k                     the number of top-scoring compounds to extract. If\n",
    "                        there are not at least `k` compounds with scores, then\n",
    "                        the `<k` scored compounds will be extracted.\n",
    "\n",
    "optional arguments:\n",
    "  -h, --help            show this help message and exit\n",
    "  -n NAME, --name NAME  the name of the directory under which all output will\n",
    "                        be placed\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Running `extract` on our MolPAL run with a value of `100` creates a `poses` subdirectory which contains the docking poses of the top-100 compounds in the native output format of our chosen docking software (e.g., PDBQT for Vina):\n",
    "```\n",
    "$ molpal extract my_molpal_run 100\n",
    "$ tree my_molpal_run -F\n",
    "my_molpal_run/\n",
    "├── all_explored_final.csv \n",
    "├── chkpts/\n",
    "├── config.ini\n",
    "├── data/\n",
    "├── poses/\n",
    "├── *.tar.gz\n",
    "└── extended.csv\n",
    "```\n",
    "Next using the `extended.csv` file in our MolPAL output directory, we can identify which compounds are interesting to us and the name associated with their pose file. The poses may be inspected visually using tools like [Chimera](https://www.cgl.ucsf.edu/chimera/) or [PyMol](https://pymol.org/2/). For large amounts of compounds, it can be useful to help automate this process by (1) filtering out compounds which **do not** make key interactions and (2) reranking the compounds based on the score of the pose that *does* make those key interactions (assuming you have a binding hypothesis) using tools like [ProLIF](https://prolif.readthedocs.io/en/latest/index.html)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3.8.13 ('molpal')",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.13"
  },
  "orig_nbformat": 4,
  "vscode": {
   "interpreter": {
    "hash": "1600cd4d85c9e3d5f9de1a7dc48bbe2286f5e3a3d48b3fa0d95af6f1f93a23b3"
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
