{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Conformalized quantile regression (CQR): Real data experiment\n",
    "\n",
    "In this tutorial we will load a real dataset and construct prediction intervals using CQR [1].\n",
    "\n",
    "[1] Yaniv Romano, Evan Patterson, and Emmanuel J. Candes, “Conformalized quantile regression.” 2019.\n",
    "\n",
    "## Prediction intervals\n",
    "\n",
    "Suppose we are given $ n $ training samples $ \\{(X_i, Y_i)\\}_{i=1}^n$ and we must now predict the unknown value of $Y_{n+1}$ at a test point $X_{n+1}$. We assume that all the samples $ \\{(X_i,Y_i)\\}_{i=1}^{n+1} $ are drawn exchangeably$-$for instance, they may be drawn i.i.d.$-$from an arbitrary joint distribution $P_{XY}$ over the feature vectors $ X\\in \\mathbb{R}^p $ and response variables $ Y\\in \\mathbb{R} $. We aim to construct a marginal distribution-free prediction interval $C(X_{n+1}) \\subseteq \\mathbb{R}$ that is likely to contain the unknown response $Y_{n+1} $. That is, given a desired miscoverage rate $ \\alpha $, we ask that\n",
    "$$ \\mathbb{P}\\{Y_{n+1} \\in C(X_{n+1})\\} \\geq 1-\\alpha $$\n",
    "for any joint distribution $ P_{XY} $ and any sample size $n$. The probability in this statement is marginal, being taken over all the samples $ \\{(X_i, Y_i)\\}_{i=1}^{n+1} $.\n",
    "\n",
    "To accomplish this, we build on the method of split conformal prediction. We first split the training data into two disjoint subsets, a proper training set and a calibration set. We fit two quantile regressors on the proper training set to obtain initial estimates of the lower and upper bounds of the prediction interval. Then, using the calibration set, we conformalize and, if necessary, correct this prediction interval. Unlike the original interval, the conformalized prediction interval is guaranteed to satisfy the coverage requirement regardless of the choice or accuracy of the quantile regression estimator.\n",
    "\n",
    "\n",
    "\n",
    "## A case study\n",
    "\n",
    "We start by importing several libraries, loading the real dataset and standardize its features and response. We set the target miscoverage rate $\\alpha$ to 0.1."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Dataset: community\n",
      "Dimensions: train set (n=1595, p=100) ; test set (n=399, p=100)\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/Users/yu.bai/anaconda3/envs/cqr/lib/python3.7/site-packages/sklearn/utils/deprecation.py:58: DeprecationWarning: Class Imputer is deprecated; Imputer was deprecated in version 0.20 and will be removed in 0.22. Import impute.SimpleImputer from sklearn instead.\n",
      "  warnings.warn(msg, category=DeprecationWarning)\n"
     ]
    }
   ],
   "source": [
    "import torch\n",
    "import random\n",
    "import numpy as np\n",
    "np.warnings.filterwarnings('ignore')\n",
    "\n",
    "from datasets import datasets\n",
    "from sklearn.preprocessing import StandardScaler\n",
    "from sklearn.model_selection import train_test_split\n",
    "\n",
    "seed = 1\n",
    "\n",
    "random_state_train_test = seed\n",
    "random.seed(seed)\n",
    "np.random.seed(seed)\n",
    "torch.manual_seed(seed)\n",
    "if torch.cuda.is_available():\n",
    "    torch.cuda.manual_seed_all(seed)\n",
    "    \n",
    "# desired miscoverage error\n",
    "alpha = 0.1\n",
    "\n",
    "# desired quanitile levels\n",
    "quantiles = [0.05, 0.95]\n",
    "\n",
    "# used to determine the size of test set\n",
    "test_ratio = 0.2\n",
    "\n",
    "# name of dataset\n",
    "dataset_base_path = \"./datasets/\"\n",
    "dataset_name = \"community\"\n",
    "\n",
    "# load the dataset\n",
    "X, y = datasets.GetDataset(dataset_name, dataset_base_path)\n",
    "\n",
    "# divide the dataset into test and train based on the test_ratio parameter\n",
    "x_train, x_test, y_train, y_test = train_test_split(X,\n",
    "                                                    y,\n",
    "                                                    test_size=test_ratio,\n",
    "                                                    random_state=random_state_train_test)\n",
    "\n",
    "# reshape the data\n",
    "x_train = np.asarray(x_train)\n",
    "y_train = np.asarray(y_train)\n",
    "x_test = np.asarray(x_test)\n",
    "y_test = np.asarray(y_test)\n",
    "\n",
    "# compute input dimensions\n",
    "n_train = x_train.shape[0]\n",
    "in_shape = x_train.shape[1]\n",
    "\n",
    "# display basic information\n",
    "print(\"Dataset: %s\" % (dataset_name))\n",
    "print(\"Dimensions: train set (n=%d, p=%d) ; test set (n=%d, p=%d)\" % \n",
    "      (x_train.shape[0], x_train.shape[1], x_test.shape[0], x_test.shape[1]))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Data splitting\n",
    "\n",
    "We begin by splitting the data into a proper training set and a calibration set. Recall that the main idea is to fit a regression model on the proper training samples, then use the residuals on a held-out validation set to quantify the uncertainty in future predictions."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "# divide the data into proper training set and calibration set\n",
    "idx = np.random.permutation(n_train)\n",
    "n_half = int(np.floor(n_train/2))\n",
    "idx_train, idx_cal = idx[:n_half], idx[n_half:2*n_half]\n",
    "\n",
    "# zero mean and unit variance scaling \n",
    "scalerX = StandardScaler()\n",
    "scalerX = scalerX.fit(x_train[idx_train])\n",
    "\n",
    "# scale\n",
    "x_train = scalerX.transform(x_train)\n",
    "x_test = scalerX.transform(x_test)\n",
    "\n",
    "# scale the labels by dividing each by the mean absolute response\n",
    "mean_y_train = np.mean(np.abs(y_train[idx_train]))\n",
    "y_train = np.squeeze(y_train)/mean_y_train\n",
    "y_test = np.squeeze(y_test)/mean_y_train"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## CQR random forests\n",
    "\n",
    "Given these two subsets, we now turn to conformalize the initial prediction interval constructed by quantile random forests [2]. Below, we set the hyper-parameters of the CQR random forests method.\n",
    "\n",
    "[2] Meinshausen Nicolai. \"Quantile regression forests.\" Journal of Machine Learning Research 7, no. Jun (2006): 983-999."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#########################################################\n",
    "# Quantile random forests parameters\n",
    "# (See QuantileForestRegressorAdapter class in helper.py)\n",
    "#########################################################\n",
    "\n",
    "# the number of trees in the forest\n",
    "n_estimators = 1000\n",
    "\n",
    "# the minimum number of samples required to be at a leaf node\n",
    "# (default skgarden's parameter)\n",
    "min_samples_leaf = 1\n",
    "\n",
    "# the number of features to consider when looking for the best split\n",
    "# (default skgarden's parameter)\n",
    "max_features = x_train.shape[1]\n",
    "\n",
    "# target quantile levels\n",
    "quantiles_forest = [quantiles[0]*100, quantiles[1]*100]\n",
    "\n",
    "# use cross-validation to tune the quantile levels?\n",
    "cv_qforest = True\n",
    "\n",
    "# when tuning the two QRF quantile levels one may\n",
    "# ask for a prediction band with smaller average coverage\n",
    "# to avoid too conservative estimation of the prediction band\n",
    "# This would be equal to coverage_factor*(quantiles[1] - quantiles[0])\n",
    "coverage_factor = 0.85\n",
    "\n",
    "# ratio of held-out data, used in cross-validation\n",
    "cv_test_ratio = 0.05\n",
    "\n",
    "# seed for splitting the data in cross-validation.\n",
    "# Also used as the seed in quantile random forests function\n",
    "cv_random_state = 1\n",
    "\n",
    "# determines the lowest and highest quantile level parameters.\n",
    "# This is used when tuning the quanitle levels by cross-validation.\n",
    "# The smallest value is equal to quantiles[0] - range_vals.\n",
    "# Similarly, the largest value is equal to quantiles[1] + range_vals.\n",
    "cv_range_vals = 30\n",
    "\n",
    "# sweep over a grid of length num_vals when tuning QRF's quantile parameters                   \n",
    "cv_num_vals = 10"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Symmetric nonconformity score \n",
    "\n",
    "In the following cell we run the entire CQR procudure. The class `QuantileForestRegressorAdapter` defines the underlying estimator. The class `RegressorNc` defines the CQR objecct, which uses `QuantileRegErrFunc` as the nonconformity score. The function `run_icp` fits the regression function to the proper training set, corrects (if required) the initial estimate of the prediction interval using the calibration set, and returns the conformal band. Lastly, we compute the average coverage and length on future test data using `compute_coverage`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from cqr import helper\n",
    "from nonconformist.nc import RegressorNc\n",
    "from nonconformist.nc import QuantileRegErrFunc\n",
    "\n",
    "# define the QRF's parameters \n",
    "params_qforest = dict()\n",
    "params_qforest[\"n_estimators\"] = n_estimators\n",
    "params_qforest[\"min_samples_leaf\"] = min_samples_leaf\n",
    "params_qforest[\"max_features\"] = max_features\n",
    "params_qforest[\"CV\"] = cv_qforest\n",
    "params_qforest[\"coverage_factor\"] = coverage_factor\n",
    "params_qforest[\"test_ratio\"] = cv_test_ratio\n",
    "params_qforest[\"random_state\"] = cv_random_state\n",
    "params_qforest[\"range_vals\"] = cv_range_vals\n",
    "params_qforest[\"num_vals\"] = cv_num_vals\n",
    "\n",
    "# define QRF model\n",
    "quantile_estimator = helper.QuantileForestRegressorAdapter(model=None,\n",
    "                                                           fit_params=None,\n",
    "                                                           quantiles=quantiles_forest,\n",
    "                                                           params=params_qforest)\n",
    "        \n",
    "# define the CQR object\n",
    "nc = RegressorNc(quantile_estimator, QuantileRegErrFunc())\n",
    "\n",
    "# run CQR procedure\n",
    "y_lower, y_upper = helper.run_icp(nc, x_train, y_train, x_test, idx_train, idx_cal, alpha)\n",
    "\n",
    "# compute and print average coverage and average length\n",
    "coverage_cp_qforest, length_cp_qforest = helper.compute_coverage(y_test,\n",
    "                                                                 y_lower,\n",
    "                                                                 y_upper,\n",
    "                                                                 alpha,\n",
    "                                                                 \"CQR Random Forests\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As can be seen, we obtained valid coverage.\n",
    "\n",
    "### Asymmetric nonconformity score \n",
    "\n",
    "The nonconformity score function `QuantileRegErrFunc` treats the left and right tails symmetrically, but if the error distribution is significantly skewed, one may choose to treat them asymmetrically. This can be done by replacing `QuantileRegErrFunc` with `QuantileRegAsymmetricErrFunc`, as implemented in the following cell."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from nonconformist.nc import QuantileRegAsymmetricErrFunc\n",
    "\n",
    "# define QRF model\n",
    "quantile_estimator = helper.QuantileForestRegressorAdapter(model=None,\n",
    "                                                           fit_params=None,\n",
    "                                                           quantiles=quantiles_forest,\n",
    "                                                           params=params_qforest)\n",
    "        \n",
    "# define the CQR object\n",
    "nc = RegressorNc(quantile_estimator, QuantileRegAsymmetricErrFunc())\n",
    "\n",
    "# run CQR procedure\n",
    "y_lower, y_upper = helper.run_icp(nc, x_train, y_train, x_test, idx_train, idx_cal, alpha)\n",
    "\n",
    "# compute and print average coverage and average length\n",
    "coverage_cp_qforest, length_cp_qforest = helper.compute_coverage(y_test,\n",
    "                                                                 y_lower,\n",
    "                                                                 y_upper,\n",
    "                                                                 alpha,\n",
    "                                                                 \"Asymmetric CQR Random Forests\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Above, we also obtained valid coverage.\n",
    "\n",
    "\n",
    "## CQR neural net\n",
    "\n",
    "In what follows we will use neural network as the underlying quantile regression method. Below, we set the hyper-parameters of the CQR neural network method."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#####################################################\n",
    "# Neural network parameters\n",
    "# (See AllQNet_RegressorAdapter class in helper.py)\n",
    "#####################################################\n",
    "from cqr import helper\n",
    "\n",
    "# pytorch's optimizer object\n",
    "nn_learn_func = torch.optim.Adam\n",
    "\n",
    "# number of epochs\n",
    "epochs = 1000\n",
    "\n",
    "# learning rate\n",
    "lr = 0.0005\n",
    "\n",
    "# mini-batch size\n",
    "batch_size = 64\n",
    "\n",
    "# hidden dimension of the network\n",
    "hidden_size = 64\n",
    "\n",
    "# dropout regularization rate\n",
    "dropout = 0.1\n",
    "\n",
    "# weight decay regularization\n",
    "wd = 1e-6\n",
    "\n",
    "# Ask for a reduced coverage when tuning the network parameters by \n",
    "# cross-validataion to avoid too concervative initial estimation of the \n",
    "# prediction interval. This estimation will be conformalized by CQR.\n",
    "quantiles_net = [0.1, 0.9]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We now turn to invoke the CQR procedure. The class `AllQNet_RegressorAdapter` defines the underlying neural network estimator. Just as before, `RegressorNc` defines the CQR objecct, which uses `QuantileRegErrFunc` as the nonconformity score. The function `run_icp` returns the conformal band, computed on test data. Lastly, we compute the average coverage and length using `compute_coverage`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "helper.compute_coverage_len"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# define quantile neural network model\n",
    "quantile_estimator = helper.AllQNet_RegressorAdapter(model=None,\n",
    "                                                     fit_params=None,\n",
    "                                                     in_shape=in_shape,\n",
    "                                                     hidden_size=hidden_size,\n",
    "                                                     quantiles=quantiles_net,\n",
    "                                                     learn_func=nn_learn_func,\n",
    "                                                     epochs=epochs,\n",
    "                                                     batch_size=batch_size,\n",
    "                                                     dropout=dropout,\n",
    "                                                     lr=lr,\n",
    "                                                     wd=wd,\n",
    "                                                     test_ratio=cv_test_ratio,\n",
    "                                                     random_state=cv_random_state,\n",
    "                                                     use_rearrangement=False)\n",
    "\n",
    "# define a CQR object, computes the absolute residual error of points \n",
    "# located outside the estimated quantile neural network band \n",
    "nc = RegressorNc(quantile_estimator, QuantileRegErrFunc())\n",
    "\n",
    "# run CQR procedure\n",
    "y_lower, y_upper = helper.run_icp(nc, x_train, y_train, x_test, idx_train, idx_cal, alpha)\n",
    "\n",
    "# compute and print average coverage and average length\n",
    "coverage_cp_qnet, length_cp_qnet = helper.compute_coverage(y_test,\n",
    "                                                           y_lower,\n",
    "                                                           y_upper,\n",
    "                                                           alpha,\n",
    "                                                           \"CQR Neural Net\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Above, we can see that the prediction interval constructed by CQR Neural Net is also valid. Notice the difference in the average length between the two methods (CQR Neural Net and CQR Random Forests). \n",
    "\n",
    "## CQR neural net with rearrangement\n",
    "\n",
    "Crossing quantiles is a longstanding problem in quantile regression. This issue does not affect the validity guarantee of CQR as it holds regardless of the accuracy or choice of the quantile regression method. However, this may affect the effeciency of the resulting conformal band.\n",
    "\n",
    "Below we use the rearrangement method [3] to bypass the crossing quantile problem. Notice that we pass `use_rearrangement=True` as an argument to `AllQNet_RegressorAdapter`.\n",
    "\n",
    "[3] Chernozhukov Victor, Iván Fernández‐Val, and Alfred Galichon. “Quantile and probability curves without crossing.” Econometrica 78, no. 3 (2010): 1093-1125."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# define quantile neural network model, using the rearrangement algorithm\n",
    "quantile_estimator = helper.AllQNet_RegressorAdapter(model=None,\n",
    "                                                     fit_params=None,\n",
    "                                                     in_shape=in_shape,\n",
    "                                                     hidden_size=hidden_size,\n",
    "                                                     quantiles=quantiles_net,\n",
    "                                                     learn_func=nn_learn_func,\n",
    "                                                     epochs=epochs,\n",
    "                                                     batch_size=batch_size,\n",
    "                                                     dropout=dropout,\n",
    "                                                     lr=lr,\n",
    "                                                     wd=wd,\n",
    "                                                     test_ratio=cv_test_ratio,\n",
    "                                                     random_state=cv_random_state,\n",
    "                                                     use_rearrangement=True)\n",
    "\n",
    "# define the CQR object, computing the absolute residual error of points \n",
    "# located outside the estimated quantile neural network band \n",
    "nc = RegressorNc(quantile_estimator, QuantileRegErrFunc())\n",
    "\n",
    "# run CQR procedure\n",
    "y_lower, y_upper = helper.run_icp(nc, x_train, y_train, x_test, idx_train, idx_cal, alpha)\n",
    "\n",
    "# compute and print average coverage and average length\n",
    "coverage_cp_re_qnet, length_cp_re_qnet = helper.compute_coverage(y_test,\n",
    "                                                                 y_lower,\n",
    "                                                                 y_upper,\n",
    "                                                                 alpha,\n",
    "                                                                 \"CQR Rearrangement Neural Net\")"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.10"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
