{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "ReoBzap1acVv"
   },
   "source": [
    "# Data loading and preprocessing\n",
    "\n",
    "In this tutorial, we demonstrate how to process a 31,000 cell 27-day time course of embryoid body (EB) differentiation using scRNA-seq.\n",
    "\n",
    "We review the following steps:\n",
    "\n",
    "1. Loading 10X data  \n",
    "2. Filtering\n",
    "3. Normalization\n",
    "4. Transformation"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 0. Introducing `scprep`\n",
    "\n",
    "`scprep` is a lightweight scRNA-seq toolkit for `python` Data Scientists.\n",
    "\n",
    "Most scRNA-seq toolkits are written in `R` (the most famous being [Seurat](https://satijalab.org/seurat/)), but we (and a majority of machine learning / data scientists) develop our tools in `python`. Currently, [Scanpy](https://icb-scanpy.readthedocs-hosted.com/en/stable/) is the most popular toolkit for scRNA-seq analysis in `python`. However, Scanpy has a highly structured framework for data representation that is incompatible with the bulk of the `python` data science framework, e.g. [pandas](https://pandas.pydata.org/), [SciPy](https://www.scipy.org/), and [scikit-learn](https://scikit-learn.org/stable/).\n",
    "\n",
    "To accommodate users of the wider `python` data analysis ecosystem, we developed `scprep` (<b>s</b>ingle <b>c</b>ell <b>prep</b>aration). `scprep` makes it easier to use the pandas / SciPy / scikit-learn ecosystem for scRNA-seq analysis. Most of `scprep` is composed of helper functions to perform tasks common to single cell data like loading counts matrices, filtering & normalizing cells by library size, and calculating common statistics. The key advantage of `scprep` is that data can be stored in Pandas DataFrames, NumPy arrays, Scipy sparse matrices, and no matter which tools you choose to interface with, _it just works_.\n",
    "\n",
    "To learn more about `scprep`, you can read the documentation at https://scprep.readthedocs.io/."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "oBJb_vVW1ONw"
   },
   "source": [
    "## 1. Install tools\n",
    "\n",
    "To run this notebook, you need to download a couple of packages. Unfortunately, due to restrictions of Google CoLab, these installations won't be saved. A workaround can be found at the following link if you're interested: https://stackoverflow.com/questions/52582858/saving-pip-installs-in-google-colab\n",
    "\n",
    "For now, simply run the following line of code by clicking on the code cell and hit `Shift` + `Enter`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 373
    },
    "colab_type": "code",
    "id": "RUZx-nc_1Mdr",
    "outputId": "87ca2da2-11c4-4d45-a18b-d861b965c159"
   },
   "outputs": [],
   "source": [
    "!pip install scprep tasklogger"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "lLeE0wGiacVx"
   },
   "source": [
    "### Time course of human embryoid body differentation\n",
    "\n",
    "Low passage H1 hESCs were maintained on Matrigel-coated dishes in DMEM/F12-N2B27 media supplemented with FGF2. For EB (Embriod Body) formation, cells were treated with Dispase, dissociated into small clumps and plated in non-adherent plates in media supplemented with 20% FBS, which was prescreened for EB differentiation. Samples were collected during 3-day intervals during a 27 day-long differentiation timecourse. An undifferentiated hESC sample was also included (Figure S7D). Induction of key germ layer markers in these EB cultures was validated by qPCR (data not shown). For single cell analyses, EB cultures were dissociated, FACS sorted to remove doublets and dead cells and processed on a 10x genomics instrument to generate cDNA libraries, which were then sequenced. Small scale sequencing determined that we have successfully collected data on approximately 31,000 cells equally distributed throughout the timecourse."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "XUF17reyacVy"
   },
   "source": [
    "<a id='loading'></a>\n",
    "## 2. Loading 10X data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "oNu79_4iacVz"
   },
   "source": [
    "### Downloading Data from Mendeley Datasets\n",
    "\n",
    "The EB dataset is publically available as `scRNAseq.zip` at Mendelay Datasets at <https://data.mendeley.com/datasets/v6n743h5ng/>. \n",
    "\n",
    "Inside the scRNAseq folder, there are five subdirectories, and in each subdirectory are three files: `barcodes.tsv`, `genes.tsv`, and `matrix.mtx`. For more information about how CellRanger produces these files, check out the [Gene-Barcode Matrices Documentation](https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/output/matrices).\n",
    "\n",
    "Here's the directory structure:\n",
    "```\n",
    "download_path\n",
    "└── scRNAseq\n",
    "    ├── scRNAseq.zip\n",
    "    ├── T0_1A\n",
    "    │   ├── barcodes.tsv\n",
    "    │   ├── genes.tsv\n",
    "    │   └── matrix.mtx\n",
    "    ├── T2_3B\n",
    "    │   ├── barcodes.tsv\n",
    "    │   ├── genes.tsv\n",
    "    │   └── matrix.mtx\n",
    "    ├── T4_5C\n",
    "    │   ├── barcodes.tsv\n",
    "    │   ├── genes.tsv\n",
    "    │   └── matrix.mtx\n",
    "    ├── T6_7D\n",
    "    │   ├── barcodes.tsv\n",
    "    │   ├── genes.tsv\n",
    "    │   └── matrix.mtx\n",
    "    └── T8_9E\n",
    "        ├── barcodes.tsv\n",
    "        ├── genes.tsv\n",
    "        └── matrix.mtx\n",
    "```\n",
    "\n",
    "If you have downloaded the files already, set the `download_path` below to the directory where you saved the files. If not, the following code will download the data for you. Not that the download is 746MB: you must have sufficient disk space for the download.\n",
    "\n",
    "**Please Note:** If you are trying to analyze data generated by a more recent version of `cellranger` (a program provided by 10X Genomics to go from raw fastq files to count matrices), the three files generated per sample are `barcodes.tsv.gz`, `features.tsv.gz`, and `matrix.mtx.gz`. The procedure for loading these files with `scprep` is the same as with the previous version."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "AaPgko5lacWp"
   },
   "source": [
    "### Using `scprep` to import data into Pandas DataFrames\n",
    "\n",
    "\n",
    "We use a toolkit for loading and manipulating single-cell data called `scprep`. The function `load_10X` will automatically load 10X scRNAseq datasets (and others) into a Pandas DataFrame. DataFrames are incredibly useful tools for data analysis in Python. To learn more about them, [check out the Pandas Documentation and Tutorials](https://pandas.pydata.org/pandas-docs/stable/).\n",
    "\n",
    "\n",
    "Let's load the data and create a single matrix that we can use for preprocessing, visualization, and analysis."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "8UwGtjtSacWx"
   },
   "source": [
    "### 2.1. Standard imports\n",
    "\n",
    "We'll import these few packages at the beginning of nearly every session."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "xHNATT38acWz"
   },
   "outputs": [],
   "source": [
    "import matplotlib.pyplot as plt\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "import scprep\n",
    "\n",
    "# matplotlib settings for Jupyter notebooks only\n",
    "%matplotlib inline"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "DVlHIGbHacW8"
   },
   "source": [
    "### 2.2. Use `scprep.io.load_10X` to import all three matrices into a DataFrame\n",
    "\n",
    "#### Understanding the difference between sparse and dense data\n",
    "\n",
    "By default, `scprep.io.load_10X` loads scRNA-seq data using the Pandas DataFrame with sparse columns [(**see Pandas docs**)](https://pandas.pydata.org/pandas-docs/stable/user_guide/sparse.html) to maximize memory efficiency. However, this will be slower than loading on a dense matrix. Let's have a look at how long it takes to load the data from the first time point and how much memory it uses with dense and sparse data. We'll use `tasklogger` to measure how long each loading step takes."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "import pickle\n",
    "\n",
    "import tasklogger\n",
    "\n",
    "download_path = \"./a/anonymous./time_series/\"\n",
    "# load data in dense format\n",
    "with tasklogger.log_task(\"dense\"):\n",
    "    data_time1 = scprep.io.load_10X(\n",
    "        os.path.join(download_path, \"scRNAseq\", \"T0_1A\"), sparse=False, gene_labels=\"both\"\n",
    "    )\n",
    "\n",
    "# measure the size of the matrix with pickle\n",
    "print(\"Size: {:.1f}MB\".format(len(pickle.dumps(data_time1)) / 1024**2))\n",
    "\n",
    "# load data in sparse format\n",
    "with tasklogger.log_task(\"sparse\"):\n",
    "    data_time1 = scprep.io.load_10X(\n",
    "        os.path.join(download_path, \"scRNAseq\", \"T0_1A\"), sparse=True, gene_labels=\"both\"\n",
    "    )\n",
    "\n",
    "# measure the size of the matrix with pickle\n",
    "print(\"Size: {:.1f}MB\".format(len(pickle.dumps(data_time1)) / 1024**2))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "So for a small penalty in computation time, we can save a lot of memory. If memory is limited (for example, on Google Colaboratory) it's recommended that you use the sparse format.\n",
    "\n",
    "You'll also notice that we've received a warning (above in red) that the gene symbols are not unique. This doesn't cause a problem with the dense data, but it does with the sparse. We can specify `gene_labels='both'` to ask for our column names as `SYMB (ENSID)` which gives us both the human readability of symbols but also the uniqueness of Ensembl IDs."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Understanding the data matrix\n",
    "\n",
    "How big is our matrix? We can find out with `data.shape`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data_time1.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We have 4,700 cells (rows) and 33,600 genes (columns). Each row and column has a name associated with it."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The _columns_ refer to the genes, of features of the data. `data.columns` gives these as a list."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data_time1.columns"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The _rows_ refer to the cells, of observations of the data. `data.index` gives these as a list."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data_time1.index"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Note that some tools expect the opposite: that genes would be on the rows and cells on the columns. This is a hold-over from bulk RNA-seq, when it made sense for each replicate (of which there were few) to be a feature and each gene (of which there were many) to be an observation; however, now that we can have as many or more cells than genes, it makes more sense for cells to be observations. Keep this difference in mind when you try new tools."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Finally, we can view the actual matrix, either by simply typing the matrix name to view the whole thing, or using the `.head()` function to view the first few rows."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data_time1.head(10)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Loading the whole dataset\n",
    "\n",
    "Now we know how to load the data, let's load the data matrix for each sample (this may take a few minutes). Note that [`scprep.io`](https://scprep.readthedocs.io/en/stable/reference.html#module-scprep.io) has a range of different input functions: csv, tsv, mtx, and fcs are also available."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 358
    },
    "colab_type": "code",
    "id": "9lc31so9acW-",
    "outputId": "51524854-7c99-47f2-c79d-d8fe55c46e6c",
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "%%time\n",
    "sparse = True\n",
    "data_time1 = scprep.io.load_10X(\n",
    "    os.path.join(download_path, \"scRNAseq\", \"T0_1A\"), sparse=sparse, gene_labels=\"both\"\n",
    ")\n",
    "data_time2 = scprep.io.load_10X(\n",
    "    os.path.join(download_path, \"scRNAseq\", \"T2_3B\"), sparse=sparse, gene_labels=\"both\"\n",
    ")\n",
    "data_time3 = scprep.io.load_10X(\n",
    "    os.path.join(download_path, \"scRNAseq\", \"T4_5C\"), sparse=sparse, gene_labels=\"both\"\n",
    ")\n",
    "data_time4 = scprep.io.load_10X(\n",
    "    os.path.join(download_path, \"scRNAseq\", \"T6_7D\"), sparse=sparse, gene_labels=\"both\"\n",
    ")\n",
    "data_time5 = scprep.io.load_10X(\n",
    "    os.path.join(download_path, \"scRNAseq\", \"T8_9E\"), sparse=sparse, gene_labels=\"both\"\n",
    ")\n",
    "data_time5.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "77XS07XVacXH"
   },
   "source": [
    "### 2.3. Library size filtering\n",
    "\n",
    "#### Why we filter cells by library size\n",
    "In scRNA-seq the library size of a cell is the number of unique mRNA molecules detected in that cell. These unique molecules are identified using a random barcode incorporated during the first round of reverse transcription. This barcode is called a <b>U</b>nique <b>M</b>olecule <b>I</b>dentifier, and often we refer to the number unique mRNAs in a cell as the number of UMIs. To read more about UMIs, [Smith *et al.* (2017)](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5340976/) write about how sequencing errors and PCR amplification errors lead to innaccurate quantification of UMIs/cell.\n",
    "\n",
    "Depending on the method of scRNA-seq, the amount of library size filtering done can vary. The 10X Genomics CellRanger tool, the DropSeq and InDrops pipelines, and the Umitools package each have their own method and cutoff for determining real cells from empty droplets. Additional methods exist for trying to detect the difference between droplets containing one cell and droplets containing two cells (\"doublets\"). You can take these methods at face value or set some manual cutoffs based on your data.\n",
    "\n",
    "#### Visualing the library size distribution using `scprep`\n",
    "\n",
    "There is a helper function for plotting library size from a gene expression matrix in scprep called [`scprep.plot.plot_library_size()`](https://scprep.readthedocs.io/en/stable/reference.html#scprep.plot.plot_library_size).\n",
    "\n",
    "Let's have a look at the library size for the first time point. In this sample we see that there is a small number of cells with very small library sizes and a long tail of cells that have very high library sizes."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 316
    },
    "colab_type": "code",
    "id": "lg6jXB-IacXI",
    "outputId": "9f52f71f-9ee8-45f1-c5a4-97fb886466f1",
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "scprep.plot.plot_library_size(data_time1, log=False, title=\"Library Size Before Filtering\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Choose cutoffs above and below the main bulk of cells, removing cells that are both significantly smaller than average and significantly larger than average. You can plot the result by running `scprep.plot.plot_library_size` with `cutoff=(low, high)` or `percentile=(low, high)` where low and high are values or percentiles that you choose."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# =============\n",
    "# Select appropriate percentiles (percentile=(low, high))\n",
    "# for filtering and plot the result\n",
    "scprep.plot.plot_library_size(data_time1, percentile=(20, 80))\n",
    "# ============="
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "YG4_xhXt7RQb"
   },
   "source": [
    "#### Selecting a cutoff\n",
    "\n",
    "Several papers describe strategies for picking a maximum and minimum threshold that can be found with a quick Google search for \"library size threshold single cell RNA seq\".\n",
    "\n",
    "Most of these pick an arbitrary measure such as a certain number of deviations below or above the mean or median library size. We find that spending too much time worrying about the exact threshold is inefficient, as results tend to be robust to filtering beyond some minimum threshold.\n",
    "\n",
    "For the above dataset, we recommend removing all cells with more than 12,000 UMI / cell in fear they might represent doublets of cells. We generally also remove all cells with fewer than 1000 UMIs per cell.\n",
    "\n",
    "### Exercise 1 - Filtering cells by library size\n",
    "\n",
    "You can do this using [`scprep.filter.filter_library_size()`](https://scprep.readthedocs.io/en/stable/reference.html#scprep.filter.filter_library_size)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "o_ucsSK9acXO"
   },
   "outputs": [],
   "source": [
    "filtered_batches = []\n",
    "for batch in [data_time1, data_time2, data_time3, data_time4, data_time5]:\n",
    "    # ==================\n",
    "    # fill in your chosen `percentile` values\n",
    "    percentiles = (20, 80)\n",
    "    batch = scprep.filter.filter_library_size(batch, percentile=percentiles)\n",
    "    # ==================\n",
    "    filtered_batches.append(batch)\n",
    "del data_time1, data_time2, data_time3, data_time4, data_time5  # removes objects from memory"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Note:** If you are going to use `percentile` filter, you need to do the filtering individually for each sample.  If you are going to do filtering using hard numbers like _remove all cells with more than 12000 UMI/cell,_ you can do it for all the data together."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 299
    },
    "colab_type": "code",
    "id": "w9GKIljH4S05",
    "outputId": "3fb00fa3-a298-41e3-a7c0-a0c8fe815592"
   },
   "outputs": [],
   "source": [
    "scprep.plot.plot_library_size(\n",
    "    filtered_batches[0], log=False, range=(0, 30000), title=\"Library size after filtering\"\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The library size distribution is now much more constrained, which will reduce the effects of differences in library size (which can affect your results, even after normalization)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "JQoJnFVxacXW"
   },
   "source": [
    "### 2.4. Merge all datasets and create a vector representing the time point of each sample\n",
    "\n",
    "Now that we have filtered the datasets by library size, we can combine them into a single matrix. You can do this with [`scprep.utils.combine_batches()`](https://scprep.readthedocs.io/en/stable/reference.html#scprep.utils.combine_batches)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 345
    },
    "colab_type": "code",
    "id": "FJdO1NcKacXX",
    "outputId": "25110336-1f75-446a-a4d8-aca03196694c"
   },
   "outputs": [],
   "source": [
    "data, sample_labels = scprep.utils.combine_batches(\n",
    "    filtered_batches, [\"Day 00-03\", \"Day 06-09\", \"Day 12-15\", \"Day 18-21\", \"Day 24-27\"]\n",
    ")\n",
    "del filtered_batches  # removes objects from memory\n",
    "data.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "After combining batches, we have ~25,000 cells and 33,700 genes. Let's have a quick look at the combined matrix."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 345
    },
    "colab_type": "code",
    "id": "FJdO1NcKacXX",
    "outputId": "25110336-1f75-446a-a4d8-aca03196694c"
   },
   "outputs": [],
   "source": [
    "data.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we have one matrix containing all five time points of the experiment, and a vector (`sample_labels`) telling us which cells (rows) came from which time point. As you can see, the time point is also appended to the cell barcode."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sample_labels"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Finally, because the Google Colab client is not the most powerful client, we'll subsample this dataset to allow you to run the preprocessing steps a little more quickly."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# data, sample_labels = scprep.select.subsample(data, sample_labels, n=10000)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data.shape, sample_labels.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Batch effects\n",
    "\n",
    "It's worth noting here that we have combined five different experiments into one data matrix here, and in doing so potentially exposed ourself to a \"batch effect\". A batch effect is a difference (technical or biological) between two batches of an experiment. This can be caused by systematic error (e.g. a difference in temperature during library preparation) or a genuine biological effect of interest (e.g. in this case each batch is sampled at a different time of development.)\n",
    "\n",
    "Correcting for technical / systematic batch effects is a topic unto itself which we will cover later in the workshop; however it's worth noting that any time we combine multiple experiments, we should check to see if they are substantially different and compare this to our expectations. If the effect is large and unexpected, we can either a) attempt to correct it / account for it using computational methods, or b) modify our analysis to avoid combining the datasets."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Discussion\n",
    "\n",
    "1. Why do we remove cells with low UMI count (or library size)?\n",
    "\n",
    "Cells with low library size are most likely empty droplets.\n",
    "\n",
    "2. Why do we remove cells with high UMI count (or library size)?\n",
    "\n",
    "Cells with very high library size have a higher probability of containing two cells (i.e., a doublet)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "pGrC9YiEacXc"
   },
   "source": [
    "<a id='preprocessing'></a>\n",
    "## 3. Preprocessing: Filtering, Normalizing, and Transforming\n",
    "\n",
    "All of these steps are carried out on the whole combined data.\n",
    "\n",
    "### Filtering\n",
    "\n",
    "We filter the data by: \n",
    "1. Removing dead cells  \n",
    "2. Filtering by library size (if we did not do this prior to combining batches)\n",
    "3. Removing genes that are expressed in relatively few cells."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "3e3m5x8LacXd"
   },
   "source": [
    "### 3.1 Dead cell removal\n",
    "\n",
    "#### What does high mitochondrial gene expression indicate?\n",
    "\n",
    "Generally, we assume that cells with high detection of mitochondrial RNAs have undergone degradation of the mitochondrial membrane as a result of apoptosis. This may be from stress during dissociation, culture, or really anywhere in the experimental pipeline. As with the high and low library size cells, we want to remove the long tail from the distribution. In a successful experiment, it's typical for 5-10% of the cells to have this apoptotic signature.\n",
    "\n",
    "#### Plotting mitochondrial expression\n",
    "\n",
    "Let's look at the mitochondrial expression. You can do this using [`scprep.plot.plot_gene_set_expression()`](https://scprep.readthedocs.io/en/stable/reference.html#scprep.plot.plot_gene_set_expression), which conveniently gets you the list of mitochrondrial genes by name using the helper function [`scprep.select.get_gene_set()`](https://scprep.readthedocs.io/en/stable/reference.html#scprep.select.get_gene_set)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 362
    },
    "colab_type": "code",
    "id": "1ZKa468GacXe",
    "outputId": "5b8c2496-bdd8-4221-8ae2-d3ba07d7dadd",
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "# Plot all mitochondrial genes. There are 14, FYI.\n",
    "scprep.plot.plot_gene_set_expression(\n",
    "    data,\n",
    "    starts_with=\"MT-\",\n",
    "    library_size_normalize=True,\n",
    "    title=\"Mitochrondrial expression before filtering\",\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "3PBCeKuPacXh"
   },
   "source": [
    "Here we see that above some threshold, there is a steep increase in expression of mitochondrial RNAs. We'll remove these cells from further analysis. Choose a cut-off based on the histogram above and plot your chosen value on a new histogram.\n",
    "\n",
    "### Exercise 2 - filtering dead cells by mitochondrial expression"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# ================\n",
    "# choose a cutoff at which to plot a red line such that you\n",
    "# remove cells with aberrant mitochondrial expression\n",
    "scprep.plot.plot_gene_set_expression(\n",
    "    data, starts_with=\"MT-\", library_size_normalize=True, cutoff=400\n",
    ")\n",
    "# ================"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we can remove those cells."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "IYhfs7y9acXi"
   },
   "outputs": [],
   "source": [
    "# ================\n",
    "# Fill in your chosen cutoff value\n",
    "cutoff = 400\n",
    "# ================\n",
    "data_filt, sample_labels = scprep.filter.filter_gene_set_expression(\n",
    "    data,\n",
    "    sample_labels,\n",
    "    starts_with=\"MT-\",\n",
    "    cutoff=cutoff,\n",
    "    keep_cells=\"below\",\n",
    "    library_size_normalize=True,\n",
    ")\n",
    "data_filt.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "After removing the top 10\\% of cells by mitochrondrial expression, we're left with 9,000 cells. Let's look at that histogram again."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "scprep.plot.plot_gene_set_expression(\n",
    "    data_filt,\n",
    "    starts_with=\"MT-\",\n",
    "    library_size_normalize=True,\n",
    "    title=\"Mitochrondrial expression after filtering\",\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Much better! We have a tight distribution without a long tail, indicating that all of our cells have relatively normal mitochondrial expression."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "ZxLVx9q1acXo"
   },
   "source": [
    "### 3.2 - Filtering lowly expressed genes\n",
    "\n",
    "#### Why remove lowly expressed genes?\n",
    "\n",
    "Capturing RNA from single cells is a noisy process. The first round of reverse transcription is done in the presence of cell lysate. This results in capture of only 10-40% of the mRNA molecules in a cell leading to a phenomenon called dropout where some lowly expressed genes are not detected in cells in which they are expressed [[1](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5561556/#CR13), [2](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5561556/#CR44), [3](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5561556/#CR64), [4](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5561556/#CR65)]. As a result, some genes are so lowly expressed (or expressed not at all) that we do not have sufficient observations of that gene to make any inferences on its expression.\n",
    "\n",
    "Lowly expressed genes that may only be represented by a handful of mRNAs may not appear in a given dataset. Others might only be present in a small number of cells. Because we lack sufficient information about these genes, we remove lowly expressed genes from the gene expression matrix during preprocessing. Typically, if a gene is detected in only very few cells, it gets removed.\n",
    "\n",
    "Here, we can see that in EB dataset, there are many genes that are detected in very few cells."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "scprep.plot.histogram(\n",
    "    scprep.measure.gene_capture_count(data_filt),\n",
    "    log=True,\n",
    "    title=\"Gene capture before filtering\",\n",
    "    xlabel=\"# of cells with nonzero expression\",\n",
    "    ylabel=\"# of genes\",\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Again, we see a relatively clean distribution on the right (genes observed in many cells) with a heavy tail on the right (rare genes). Where would you choose to cut this off?\n",
    "\n",
    "### Exercise 3 - filtering rare genes"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# ================\n",
    "# choose a cutoff\n",
    "scprep.plot.histogram(\n",
    "    scprep.measure.gene_capture_count(data_filt),\n",
    "    cutoff=10,\n",
    "    log=True,\n",
    "    title=\"Gene capture before filtering\",\n",
    "    xlabel=\"# of cells with nonzero expression\",\n",
    "    ylabel=\"# of genes\",\n",
    ")\n",
    "# ================"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's go ahead and remove those genes."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "9pyvgVwracXq"
   },
   "outputs": [],
   "source": [
    "# ================\n",
    "# choose a cutoff\n",
    "cutoff = 10\n",
    "data_filt = scprep.filter.filter_rare_genes(data_filt, min_cells=cutoff)\n",
    "# ================"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can plot the above histogram again for good measure. As you can see, the rare genes are all gone."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "scprep.plot.histogram(\n",
    "    scprep.measure.gene_capture_count(data_filt),\n",
    "    cutoff=cutoff,\n",
    "    log=True,\n",
    "    title=\"Gene capture after filtering\",\n",
    "    xlabel=\"# of cells with nonzero expression\",\n",
    "    ylabel=\"# of genes\",\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data_filt.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "After filtering, we have removed many thousands of genes. This will be our last filtering step, though you can always filter out aberrant expression on an ad hoc basis. As it stands, we've quite significantly reduced our dataset from 10,000 x 33,000 that we started with. And that's not even counting the cells we removed by library size before combining time points!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "zTOt3GU2acXu"
   },
   "source": [
    "### 3.3 - Normalization\n",
    "\n",
    "As you saw during filtering, the range of library sizes between cells can be quite extreme. We visualized this for one time point pre-filtering, but let's now visualize the whole dataset, post-filtering."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "scprep.plot.plot_library_size(data_filt, title=\"Library size before normalization\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "zTOt3GU2acXu"
   },
   "source": [
    "To correct for differences in library sizes, we divide each cell by its library size and then rescale by the a fixed value, sometimes the median library size. The default in `scprep` is to rescale every cell to 10,000 counts to make numbers comparable across datasets.\n",
    "\n",
    "In python this is performed using the preprocessing method [`scprep.normalize.library_size_normalize()`](https://scprep.readthedocs.io/en/stable/reference.html#scprep.normalize.library_size_normalize)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "scprep.normalize.library_size_normalize"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Creating the `metadata` DataFrame\n",
    "\n",
    "You'll notice now we have two metadata objects: `sample_labels` and `library_size`. We can make our lives easier by putting these into a DataFrame as follows."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "metadata = pd.concat([sample_labels], axis=1)\n",
    "metadata"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "qtpLNUB2acX4"
   },
   "source": [
    "### 3.4 - Gene Count Transformation\n",
    "\n",
    "In scRNA-seq analysis, we often see that some genes are orders of magnitude more common than others. Let's have a look at the mean expression of each gene."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "scprep.plot.histogram(\n",
    "    data_norm.mean(axis=0),\n",
    "    log=\"y\",\n",
    "    title=\"Gene counts before transformation\",\n",
    "    xlabel=\"total # of gene counts\",\n",
    "    ylabel=\"# of genes\",\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 3.5 - Selecting highly variable genes\n",
    "\n",
    "Many workflows involve just selecting highly variable genes. Gene variability is defined as the variance relative to the expected variance for a given gene mean, or \"standardized variance\". That is, genes with high mean have naturally higher variance simply due to their magnitude; we are interested in the genes that vary more than one would expect by random chance.\n",
    "\n",
    "You can plot the standardized variance with [`scprep.plot.plot_gene_variability()`](https://scprep.readthedocs.io/en/stable/reference.html#scprep.plot.plot_gene_variability)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "scprep.plot.plot_gene_variability(data_sqrt, percentile=90)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If you wish to select just the highly variable genes, you can do so with [`scprep.select.highly_variable_genes()`](https://scprep.readthedocs.io/en/stable/reference.html#scprep.select.highly_variable_genes)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data_hvg = scprep.select.highly_variable_genes(data_sqrt, percentile=90)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data_hvg.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Unsurprisingly, we're now left with just 1,800 genes. This reduced size dataset can be useful for, for example, dimensionality reduction and clustering, which we will cover later in the workshop.\n",
    "\n",
    "Finally, let's save the preprocessed data (we keeping all the genes, not just the highly variable ones) for later use."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data_sqrt.to_pickle(\"embryoid_body_data.pickle.gz\")\n",
    "metadata.to_pickle(\"embryoid_body_metadata.pickle.gz\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import scanpy as sc"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "adata_raw = sc.AnnData(X=data_filt, obs=metadata)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "adata"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "adata.write_h5ad(download_path + \"ebdata_v1.h5ad\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sc.pp.normalize_total(adata)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "adata.X = np.sqrt(adata.X)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sc.external.tl.phate(adata)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "adata.write_h5ad(download_path + \"ebdata_normalized.h5ad\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data = np.load(download_path + \"eb_velocity_v5.npz\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "adata"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "adata_raw.obsm[\"X_phate\"] = adata.obsm[\"X_phate\"]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "adata_raw"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sc.pp.calculate_qc_metrics(adata_raw, inplace=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sc.pp.normalize_total(adata_raw)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sc.pp.log1p(adata_raw)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sc.pp.neighbors(adata_raw)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sc.tl.umap(adata_raw)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sc.tl.tsne(adata_raw)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "adata_raw.write_h5ad(download_path + \"ebdata.h5ad\", compression=\"gzip\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "adata_raw"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sc.pl.scatter(adata, basis=\"phate\", color=\"sample_labels\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sc.pl.scatter(adata_raw, basis=\"umap\", color=\"sample_labels\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sc.pl.scatter(adata_raw, basis=\"tsne\", color=\"sample_labels\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "adata = adata_raw"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sc.pl.scatter(adata, basis=\"phate\", color=\"total_counts\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sc.tl.leiden(adata, resolution=0.5)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "del adata.uns[\"leiden_colors\"]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sc.pl.scatter(adata, basis=\"phate\", color=\"leiden\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "adata.var[\"gene_name\"] = [s.split()[0] for s in adata.var.index]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "adata.var[\"gene_id\"] = [s.split()[1] for s in adata.var.index]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "adata.var[\"gene_name_id\"] = adata.var.index"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "adata.write_h5ad(download_path + \"ebdata.h5ad\", compression=\"gzip\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "adata.var.gene_name[adata.var.gene_name.str.startswith(\"POU5\")]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "gene_map = {\n",
    "    \"ESC\": [\"NANOG\", \"POU5F1\"],\n",
    "    \"PS\": [\"EOMES\", \"MIXL1\", \"CER1\", \"SATB1\", \"T\"],\n",
    "    \"Pre-NE\": [\"POU5F1\", \"OTX2\"],\n",
    "    \"ME\": [\"T\"],\n",
    "    \"EN\": [\"ARID3A\"],\n",
    "    \"NE-2\": [\"GBX2\", \"OLIG3\", \"HOXD1\"],\n",
    "    \"NE-1\": [\"GBX2\", \"ZIC2\", \"ZIC5\"],\n",
    "    \"Lateral Plate ME\": [\"TBX5\", \"HOXD9\", \"MYC\"],\n",
    "    \"Hemangioblast\": [\"TAL1\", \"HOXB4\", \"SOX17\", \"CD34\", \"PECAM1\"],\n",
    "    \"Cardiac-2\": [\"NKX2-5\", \"LEF1\", \"MYC\"],\n",
    "    \"EN-2\": [\"GATA3\", \"SATB1\", \"SOX15\"],\n",
    "    \"EN-1\": [\"FOXA2\", \"SOX17\"],\n",
    "    \"NE/NC\": [\"HOXA2\", \"HOXB1\"],\n",
    "    \"NE-1 2\": [\"ZIC2\", \"ZIC5\", \"GLI3\", \"LHX2\", \"LHX5\", \"PAX6\", \"SIX3\", \"SIX6\"],\n",
    "    \"Epicardial Precursors\": [\"WT1\"],\n",
    "    \"Smooth Muschle Precursors\": [\"TBX18\", \"SIX2\", \"TBX15\", \"PDGFRA\"],\n",
    "    \"Cardiac Precursors-1\": [\"GATA5\"],\n",
    "    \"Cardiac Precursors-2\": [\"GATA6\"],\n",
    "    \"Cardiac Precursors-3\": [\"TNNT2\"],\n",
    "    \"Posterior EN\": [\"CDX2\", \"ASCL2\", \"KLF5\", \"NKX2-1\"],\n",
    "    \"NC\": [\"PAX3\", \"FOXD3\", \"SOX9\", \"SOX10\"],\n",
    "    \"NP\": [\"NES\", \"MAP2\"],\n",
    "    \"Neuronal Subtypes-1\": [\"KLF7\", \"ISL1\", \"DLX1\", \"ONECUT1\", \"ONECUT2\"],\n",
    "    \"Neuronal Subtypes-2\": [\"OLIG1\", \"NPAS1\", \"LHX2\", \"NR2F1\"],\n",
    "    \"Neuronal Subtypes-3\": [\"OLIG1\", \"NPAS1\", \"DMRT3\", \"LMX1A\"],\n",
    "    \"Neuronal Subtypes-4\": [\"NKX2-8\", \"EN2\", \"SOX1\"],\n",
    "    \"Neuronal Subtypes-5\": [\"PAX6\", \"ZBTB16\"],\n",
    "}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "adata"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "adata.uns[\"gene_map\"] = gene_map"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "adata.uns[\"gene_map\"]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "full_gene_list = []\n",
    "for k, v in gene_map.items():\n",
    "    full_gene_list.extend(v)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "adata_sub = adata[:, adata.var.gene_name.isin(full_gene_list)]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "adata_sub"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.8"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}