{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "eafeca45",
   "metadata": {},
   "source": [
    "## Preprocess the gene expression RNA-seq (transcriptomic) dataset"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "b471d75e-6845-4e01-89c0-b4a6a9a1680a",
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import numpy as np\n",
    "import os"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "af012ce2",
   "metadata": {},
   "source": [
    "We first need the path to our folder containing case-organized data and the destination for storing the processed transcriptomic data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "bb35be5c",
   "metadata": {},
   "outputs": [],
   "source": [
    "ORGANIZED_BY_CASE_PATH = \".../TCGA/data_by_cases\"\n",
    "DESTINATION_DATA_PATH = \".../TCGA/data_processed/PRCSD_transcriptomic_data.csv\""
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a069f05f",
   "metadata": {},
   "source": [
    "We use the following function to read in RNA-seq data. This function should be adapted to the format of gene expression data used for a project. We isolate RNA-seq data derived in \"fragments per kilobase of exon per million mapped fragments\" (FPKM). Only protein-coding genes are included for our analysis."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "13d68835",
   "metadata": {},
   "outputs": [],
   "source": [
    "def read_gene_expression(filepath, case_id):\n",
    "    arr = []\n",
    "    with open(filepath) as f:\n",
    "        lines = f.readlines()\n",
    "        for l in lines:\n",
    "            arr.append(l.upper().split())\n",
    "    matrix = pd.DataFrame(arr)[1:]\n",
    "    matrix.columns = matrix.iloc[0]\n",
    "    matrix = matrix[matrix[\"GENE_TYPE\"] == \"PROTEIN_CODING\"]\n",
    "    matrix = matrix[['GENE_ID', 'FPKM_UNSTRANDED']].set_index('GENE_ID').transpose()\n",
    "    return matrix.rename(columns={'GENE_ID': 'CASE_ID'},index={'FPKM_UNSTRANDED': case_id}).reset_index().rename(columns={1:'CASE_ID'})\n",
    "    "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7fcc2851",
   "metadata": {},
   "outputs": [],
   "source": [
    "def preprocess_trans(by_case_path): \n",
    "    cases = os.listdir(by_case_path)\n",
    "    gene_exp_data = []\n",
    "    \"\"\"\n",
    "    Loop through every case filepath and search for transcriptomic data. \n",
    "    Apply the read CSV function to each transcriptomic data found. \n",
    "    After all the transcriptomic files are read, we can concatenate them to create a matrix where rows are cases, columns are genes, and values are the respective expression values.\n",
    "    \"\"\"\n",
    "    for case in cases:\n",
    "        contents_gene_exp = os.listdir(os.path.join(by_case_path, case, \"gene_expression\"))\n",
    "        if len(contents_gene_exp) == 0:\n",
    "            print(f\"{case} has no gene expression data\")\n",
    "        else:\n",
    "            filename = contents_gene_exp[0]\n",
    "            path = os.path.join(by_case_path, case, \"gene_expression\", filename)\n",
    "            gene_exp_data.append(read_gene_expression(path, case))   \n",
    "    all_gene_exp = pd.concat(gene_exp_data, axis = 0)\n",
    "    cols = list(set(all_gene_exp.columns) - set([\"CASE_ID\", \"GENE_ID\"]))\n",
    "    all_gene_exp[cols] = all_gene_exp[cols].astype(float)\n",
    "    all_gene_exp = all_gene_exp.drop(all_gene_exp.std()[all_gene_exp.std() == 0].index.values, axis=1)\n",
    "    all_gene_exp = all_gene_exp.rename(columns={\"CASE_ID\":\"case_id\"})\n",
    "    return all_gene_exp\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "968b3f3c",
   "metadata": {},
   "outputs": [],
   "source": [
    "trans_data = preprocess_trans(ORGANIZED_BY_CASE_PATH)\n",
    "trans_data.to_csv(DESTINATION_DATA_PATH, index=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3209745f",
   "metadata": {},
   "source": [
    "----"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.4"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
