{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "45606a88-db2d-40d0-8a07-3f3f5dd14b00",
   "metadata": {},
   "source": [
    "# Intro\n",
    "\n",
    "This notebook downloads the Epithelial cell dataset used for experiments in \"Feature Selection in the Contrastive Analysis Setting\". The notebook will download any necessary files."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "912e4d6b-38d9-432f-b6cd-8ed0c21241ae",
   "metadata": {
    "tags": []
   },
   "source": [
    "### First, we download the data from the NIH Gene Expression Omnibus and write it to disk."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "2483dc0b-e57d-4fe5-81a7-43495a8a5599",
   "metadata": {},
   "outputs": [],
   "source": [
    "import requests\n",
    "\n",
    "print('Downloading compressed data')\n",
    "\n",
    "url = 'https://ftp.ncbi.nlm.nih.gov/geo/series/GSE92nnn/GSE92332/suppl/GSE92332%5FSalmHelm%5FUMIcounts%2Etxt%2Egz'\n",
    "r = requests.get(url)\n",
    "compressed_file_name = './GSE92332_SalmHelm_UMIcounts.txt.gz'\n",
    "\n",
    "with open(compressed_file_name, 'wb') as f:\n",
    "    f.write(r.content)\n",
    "    \n",
    "print(\"Data successfully written to disk\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a6c9aff1-9712-4824-9da3-7375393432d5",
   "metadata": {},
   "source": [
    "### Next, we read in the file and begin preprocessing"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8af191d3-40ce-4f90-9a7e-6a8ecaf7b668",
   "metadata": {},
   "outputs": [],
   "source": [
    "import gzip\n",
    "import pandas as pd\n",
    "\n",
    "with gzip.open(compressed_file_name, 'rb') as f:\n",
    "    df = pd.read_csv(f, sep='\\t')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "524c254d-f649-4fcb-a8cc-5c1b7895f2e0",
   "metadata": {},
   "source": [
    "The data was originally stored with each gene being a row and each cell being a column. We'll transpose our data matrix to have a more standard arrangement of rows being samples and features being columns"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8de8c135-bf29-439e-a16d-79e1c7dc2053",
   "metadata": {},
   "outputs": [],
   "source": [
    "df = df.transpose()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c9c19dd2-cac2-41d7-ae49-e50c7674ab6a",
   "metadata": {},
   "source": [
    "Next, we'll extract metadata from each cell's name. First, we'll display a sample of cell names."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1baaa147-5b58-4aac-9c6d-289d771f2b49",
   "metadata": {},
   "outputs": [],
   "source": [
    "df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "343a42d2-f0aa-4852-b4c4-2091bf7b5272",
   "metadata": {},
   "source": [
    "We can see that these names contain a number of useful pieces of metadata (e.g. cell type) separated by the `_` character. We'll extract that metadata here."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "83d4117d-bb8b-49b9-ae45-5cbf1aceadf1",
   "metadata": {},
   "outputs": [],
   "source": [
    "cell_groups = []\n",
    "barcodes = []\n",
    "conditions = []\n",
    "cell_types = []\n",
    "\n",
    "for cell in df.index:\n",
    "    cell_group, barcode, condition, cell_type = cell.split('_')\n",
    "    cell_groups.append(cell_group)\n",
    "    barcodes.append(barcode)\n",
    "    conditions.append(condition)\n",
    "    cell_types.append(cell_type)\n",
    "    \n",
    "metadata_df = pd.DataFrame({'cell_group': cell_groups, 'barcode': barcodes, 'condition': conditions, 'cell_type': cell_types})"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2f75db0f-8496-4ae9-bcf8-eb72a331e70d",
   "metadata": {},
   "source": [
    "Finally, we'll perform some standard preprocessing steps on our scRNA-seq data. First, we'll normalize the data so that count numbers are comparable across cells and log-transform the resulting normalized counts. To do so, we'll use functions from `scanpy`, a popular Python library for handling scRNA-seq data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "3eee9df7-6c95-4f30-8708-bd6b050c6f98",
   "metadata": {},
   "outputs": [],
   "source": [
    "import scanpy as sc\n",
    "from anndata import AnnData\n",
    "\n",
    "adata = AnnData(X = df.values, obs=metadata_df) # The annotated dataframe (AnnData) is a wrapper class used by scanpy for most of its functions.\n",
    "sc.pp.normalize_total(adata, 1e4)\n",
    "sc.pp.log1p(adata)\n",
    "\n",
    "np.save('data/epithel_new/Control.npy', adata[adata.obs['conditions'] == 'Control'].X.copy())\n",
    "np.save('data/epithel_new/Hpoly.npy', adata[adata.obs['conditions'] == 'Hpoly.Day10'].X.copy())\n",
    "np.save('data/epithel_new/Salmonella.npy', adata[adata.obs['conditions'] == 'Salmonella'].X.copy())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9a84f90c-3949-4e81-9599-6cc055707044",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.4"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
