{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Notebook for preprocessing/reducing MCMC file sizes\n",
    "- Steps to be followed after downloading the Goodwin/ Lotka-Volterra and Hinch data from https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/MDKNWM\n",
    "- For Goodwin and Lotka experiments, the relevant files are titled theta.csv in the respective folders (theta denotes the 4 dimensional parameters), for each setting the csv file has 2M x 4 dimensional data (other files related to posterior, and gradient are not relevant for kernel thinning experiments)\n",
    "    - There is a single posterior setting for each of Goodwin and Lotka model \n",
    "    - For each posterior sampling, 4 random walks, RW, Ada-RW, MALA, precond-MALA, are simulated for 2M iterations (over 4 dimensional parameter space)\n",
    "    - We save each of the 8 csv files as pkl files\n",
    "- For Hinch experiments, the data files have 'theta' in their file name (but on July 28 are part of Posterior.zip and Temperedposterior.zip in Cardiac folder)\n",
    "    - There are two posteriors, posterior (Post) and tempered posterior (TP) (the temperature is 1 for Post, and 8 for TP)\n",
    "    - For both Post and TP, a single chain, RW is simulated for 4M iterations (over 38 dimensional parameter space)\n",
    "    - The csv files are 38 x 4M dimensional (making pre-processing a bit more involved since there are more columns than rows)\n",
    "    - We first save transposed data as another csv file and then save the transposed data (4M by 38) as pkl for fast access; transposing makes it tractable for pd.read_csv to work with chunksize\n",
    "    - We also save the Pnmax, and samples for this setting using burn-in of 1M samples as noted in Appendix S5.4 of the paper https://arxiv.org/abs/2005.03952 (v3)  (this is unlike Goodwin/LV where we simply saved the entire chain as pkl file)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "from tqdm import tqdm\n",
    "import pickle as pkl\n",
    "from sys import getsizeof\n",
    "import numpy as np\n",
    "import csv\n",
    "# import dask.dataframe\n",
    "import os\n",
    "import os.path\n",
    "import time"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "a = ! ls -LR | find . -name \\theta\\*.csv\n",
    "\n",
    "a = [\"./MCMC_DATA/Goodwin_ADA-RW.csv\"]\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pathlib\n",
    "\n",
    "pathlib.Path(\"data\").mkdir(parents=True, exist_ok=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Goodwin and Lotka Files\n",
    "\n",
    "- these files are small and can be dealth with easily directly in the memory\n",
    "- of size 2M by 4 (iterations x dimensions)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "\n",
    "\n",
    "for name in a:\n",
    "    if \"Goodwin\" in name or \"Lotka\" in name:\n",
    "        pkl_name = name.replace(\"./MCMC_DATA\", \"./data\") # find pkl file in data folder\n",
    "        pkl_name = pkl_name.replace('.csv', '_theta.pkl')\n",
    "        if os.path.exists(pkl_name):\n",
    "            print(f'pkl file exists ({pkl_name})')\n",
    "        else:\n",
    "            print(f'loading {name} and dumping as pkl in {pkl_name}')\n",
    "            data = np.zeros((int(2e6), 4))\n",
    "            chunksize = 8000\n",
    "            idx = int(0)\n",
    "            for i, chunk in tqdm(enumerate(pd.read_csv(name, header=None, chunksize=chunksize))):\n",
    "                n = chunk.shape[0]\n",
    "                data[range(idx, idx+n), :] = chunk\n",
    "                idx += n\n",
    "            print(f'saving {name} as pkl file in {pkl_name}')\n",
    "            with open(pkl_name, 'wb') as file:\n",
    "                pkl.dump(data, file, protocol=pkl.HIGHEST_PROTOCOL)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Cardiac/Hinch Data Files\n",
    "\n",
    "- these files are huge, so need some better preprocessing\n",
    "- By default saved as dimensions x iterations (38 x 4M)\n",
    "- so we first save a transpose of it, and then convert it into a pkl file which are much faster to load"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "filenames = ['./MCMC_DATA/Posterior/seed_1/THETA_seed_1_temp_1.csv', './MCMC_DATA/Posterior/seed_2/THETA_seed_2_temp_1.csv',\n",
    "            './MCMC_DATA/Tempered_posterior/seed_1/THETA_seed_1_temp_8.csv', './MCMC_DATA/Tempered_posterior/seed_2/THETA_seed_2_temp_8.csv'\n",
    "           ]\n",
    "filenames_transposed = [f[:-4] + '_transposed.csv' for f in filenames]\n",
    "pkl_names = [g[:-3] + 'pkl' for g in filenames_transposed]\n",
    "for f, g, p in zip(filenames, filenames_transposed, pkl_names):\n",
    "    print(f, g, p)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# transpose to create efficient loaders with pandas\n",
    "for f, g in zip(filenames, filenames_transposed):\n",
    "    s = time.time()\n",
    "    if os.path.exists(g):\n",
    "        print(f'tranpose {g} exists for {f}')\n",
    "        pass\n",
    "    else:\n",
    "        print(f'transposing {f} file with original shape 38 x 4M')\n",
    "        a = zip(*csv.reader(open(f, \"rt\")))\n",
    "        csv.writer(open(g, \"wt\")).writerows(a)\n",
    "        print(f'This loop took {time.time()-s} seconds') # takes around 3 minutes on mac m1 first gen with 16GB ram"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "for g, p in zip(filenames_transposed, pkl_names):\n",
    "    if os.path.exists(p):\n",
    "        print(f'Transposed pkl exists ({p})')\n",
    "    else:\n",
    "        print(f'loading {g}')\n",
    "        data = np.zeros((int(4e6), 38))\n",
    "        chunksize = 8000\n",
    "        idx = int(0)\n",
    "        for i, chunk in tqdm(enumerate(pd.read_csv(g, header=None, chunksize=chunksize))):\n",
    "            n = chunk.shape[0]\n",
    "            data[range(idx, idx+n), :] = chunk\n",
    "            idx += n\n",
    "        print(f'saving {g} as pkl file in {p}')\n",
    "        with open(p, 'wb') as file:\n",
    "            pkl.dump(data, file, protocol=pkl.HIGHEST_PROTOCOL)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Further processing for Hinch data\n",
    "\n",
    "- while loading the pkl file in memory is no longer an issue, it is huge in size so transferring it to cluster would take forever\n",
    "- we preprocess Pnmax of size 2^15 and samples of sizes 4^m for m in {0, 1, ..., 7} by standard thinning from the end after burn-in of 1M samples"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Saving with normalizing (Hinch settings)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "pkl_names = ['./MCMC_DATA/Posterior/seed_1/THETA_seed_1_temp_1_transposed.pkl', \n",
    "            './MCMC_DATA/Posterior/seed_2/THETA_seed_2_temp_1_transposed.pkl',\n",
    "            './MCMC_DATA/Tempered_posterior/seed_1/THETA_seed_1_temp_8_transposed.pkl',\n",
    "             './MCMC_DATA/Tempered_posterior/seed_2/THETA_seed_2_temp_8_transposed.pkl'\n",
    "            ]\n",
    "\n",
    "prefixes = [\"data/\" + f for f in ['Hinch_P_seed_1_temp_1', 'Hinch_P_seed_2_temp_1', 'Hinch_TP_seed_1_temp_8', 'Hinch_TP_seed_2_temp_8']]\n",
    "\n",
    "for p, prefix in zip(pkl_names, prefixes):\n",
    "    \n",
    "    burn_in = int(1e6)\n",
    "    print(f'\\n loading {p}')\n",
    "    with open(p, 'rb') as file:\n",
    "        X = pkl.load(file)\n",
    "    X = X[burn_in:]\n",
    "    \n",
    "    # separate in odd/even indices\n",
    "    idx_even = np.arange(X.shape[0]-1, 1, -2)[::-1]\n",
    "    idx_odd = np.arange(X.shape[0]-2, 0, -2)[::-1]\n",
    "    assert(len(set(idx_even).intersection(set(idx_odd)))==0)\n",
    "\n",
    "    # compute pnmax\n",
    "    Xpnmax = X[idx_odd]\n",
    "    end = Xpnmax.shape[0]\n",
    "    nmax = int(2**15)\n",
    "    step_size = int(end / nmax)\n",
    "    assert(step_size>=1)\n",
    "    # compute Pnmax by standard thinning from end\n",
    "    idx_Pnmax = np.arange(end-1, 0, -step_size)[:nmax][::-1] # standard thin from the end\n",
    "    filename = prefix + \"_pnmax_15.pkl\"\n",
    "    print(f'saving pnmax of size {Xpnmax[idx_Pnmax].shape} to {filename}')\n",
    "    with open(filename, \"wb\") as file:\n",
    "        pkl.dump(Xpnmax[idx_Pnmax], file, protocol=pkl.HIGHEST_PROTOCOL)\n",
    "        \n",
    "    # compute samples\n",
    "    X = X[idx_even]\n",
    "    \n",
    "    for m in range(8, 10):\n",
    "        n = int(4**m)\n",
    "        end = X.shape[0]\n",
    "        # compute thinning parameter\n",
    "        step_size = int(end / n)\n",
    "        start = end-step_size*n\n",
    "        assert(step_size>=1)\n",
    "        assert(start>=0)\n",
    "        samples = X[end-1:start:-step_size][::-1]\n",
    "        filename = prefix + f\"_samples_n_{n}.pkl\"\n",
    "        print(f'saving samples of size {samples.shape} to {filename}')\n",
    "        with open(filename, \"wb\") as file:\n",
    "            pkl.dump(samples, file, protocol=pkl.HIGHEST_PROTOCOL)\n",
    "        "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Saving after normalizing (Hinch Scaled)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.preprocessing import StandardScaler"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "pkl_names = ['./MCMC_DATA/Posterior/seed_1/THETA_seed_1_temp_1_transposed.pkl', \n",
    "            './MCMC_DATA/Posterior/seed_2/THETA_seed_2_temp_1_transposed.pkl',\n",
    "            './MCMC_DATA/Tempered_posterior/seed_1/THETA_seed_1_temp_8_transposed.pkl',\n",
    "             './MCMC_DATA/Tempered_posterior/seed_2/THETA_seed_2_temp_8_transposed.pkl'\n",
    "            ]\n",
    "\n",
    "prefixes_scaled = [f + '_scaled' for f in prefixes]\n",
    "\n",
    "for p, prefix in zip(pkl_names, prefixes_scaled):\n",
    "    \n",
    "    burn_in = int(1e6)\n",
    "    print(f'\\n loading {p}')\n",
    "    with open(p, 'rb') as file:\n",
    "        X = pkl.load(file)\n",
    "    \n",
    "    # standardize the data\n",
    "    \n",
    "    X = X[burn_in:]\n",
    "    scaler = StandardScaler().fit(X)\n",
    "    X = scaler.transform(X) # center and scale the data    \n",
    "    # separate in odd/even indices\n",
    "    idx_even =  np.arange(X.shape[0]-1, -1, -2)[::-1]\n",
    "    idx_odd = np.arange(X.shape[0]-2, -1, -2)[::-1]\n",
    "    assert(len(set(idx_even).intersection(set(idx_odd)))==0)\n",
    "\n",
    "    # compute pnmax\n",
    "    nmax = int(2**15)\n",
    "    idx_Pnmax = np.linspace(0, len(idx_odd)-1,  nmax, dtype=int, endpoint=True)\n",
    "    filename = prefix + \"_pnmax_15.pkl\"\n",
    "    print(f'saving pnmax of len {len(idx_Pnmax)} to {filename}')\n",
    "    with open(filename, \"wb\") as file:\n",
    "        pkl.dump(X[idx_odd][idx_Pnmax], file, protocol=pkl.HIGHEST_PROTOCOL)\n",
    "        \n",
    "    # compute samples\n",
    "    for m in range(0, 9):\n",
    "        n = int(4**m)\n",
    "        end = X.shape[0]\n",
    "        idx_samples = np.linspace(0, len(idx_even)-1,  int(n), dtype=int, endpoint=True)\n",
    "        filename = prefix + f\"_samples_n_{n}.pkl\"\n",
    "        print(f'saving samples of len {len(idx_samples)} to {filename}')\n",
    "        with open(filename, \"wb\") as file:\n",
    "            pkl.dump(X[idx_even][idx_samples], file, protocol=pkl.HIGHEST_PROTOCOL)\n",
    "        "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Computing median parameters for the Hinch Data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from scipy.spatial.distance import pdist"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "fnames = [ f + \"_samples_n_16384.pkl\" for f in prefixes]\n",
    "for f in fnames:\n",
    "    with open(f, \"rb\") as file:\n",
    "        X = pkl.load(file)\n",
    "    print(f, X.shape, np.nanmedian(pdist(X)).round(4))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "fnames = [ f + \"_samples_n_16384.pkl\" for f in prefixes_scaled]\n",
    "for f in fnames:\n",
    "    with open(f, \"rb\") as file:\n",
    "        X = pkl.load(file)\n",
    "    print(f, X.shape, np.nanmedian(pdist(X)).round(4))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "language_info": {
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
