{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# COMPAS analysis\n",
    "\n",
    "We recreate the first section of the [Propublica COMPAS analysis](https://github.com/propublica/compas-analysis) in Python. \n",
    "\n",
    "What follows are the calculations performed for ProPublica's analaysis of the COMPAS Recidivism Risk Scores. It might be helpful to open [the methodology](https://www.propublica.org/article/how-we-analyzed-the-compas-recidivism-algorithm/) in another tab to understand the following.\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Loading the Data\n",
    "\n",
    "We select fields for severity of charge, number of priors, demographics, age, sex, compas scores, and whether each person was accused of a crime within two years.\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 38,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import datetime"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 39,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Num rows: 7214\n"
     ]
    }
   ],
   "source": [
    "raw_data = pd.read_csv('./compas-scores-two-years.csv')\n",
    "print('Num rows: %d' %len(raw_data))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "However not all of the rows are useable for the first round of analysis.\n",
    "\n",
    "There are a number of reasons remove rows because of missing data:\n",
    "\n",
    " - If the charge date of a defendants Compas scored crime was not within 30 days from when the person was arrested, we assume that because of data quality reasons, that we do not have the right offense.\n",
    " - We coded the recidivist flag -- is_recid -- to be -1 if we could not find a compas case at all.\n",
    " - In a similar vein, ordinary traffic offenses -- those with a df of 'O' -- will not result in Jail time are removed (only two of them).\n",
    " - We filtered the underlying data from Broward county to include only those rows representing people who had either recidivated in two years, or had at least two years outside of a correctional facility.\n",
    " - We remove rows where there is no score_text ('N/A')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 40,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Index(['id', 'name', 'first', 'last', 'compas_screening_date', 'sex', 'dob',\n",
       "       'age', 'age_cat', 'race', 'juv_fel_count', 'decile_score',\n",
       "       'juv_misd_count', 'juv_other_count', 'priors_count',\n",
       "       'days_b_screening_arrest', 'c_jail_in', 'c_jail_out', 'c_case_number',\n",
       "       'c_offense_date', 'c_arrest_date', 'c_days_from_compas',\n",
       "       'c_charge_degree', 'c_charge_desc', 'is_recid', 'r_case_number',\n",
       "       'r_charge_degree', 'r_days_from_arrest', 'r_offense_date',\n",
       "       'r_charge_desc', 'r_jail_in', 'r_jail_out', 'violent_recid',\n",
       "       'is_violent_recid', 'vr_case_number', 'vr_charge_degree',\n",
       "       'vr_offense_date', 'vr_charge_desc', 'type_of_assessment',\n",
       "       'decile_score.1', 'score_text', 'screening_date',\n",
       "       'v_type_of_assessment', 'v_decile_score', 'v_score_text',\n",
       "       'v_screening_date', 'in_custody', 'out_custody', 'priors_count.1',\n",
       "       'start', 'end', 'event', 'two_year_recid'],\n",
       "      dtype='object')"
      ]
     },
     "execution_count": 40,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "raw_data.columns"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 41,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Num rows filtered: 6172\n"
     ]
    }
   ],
   "source": [
    "df = raw_data[((raw_data['days_b_screening_arrest'] <=30) & \n",
    "      (raw_data['days_b_screening_arrest'] >= -30) &\n",
    "      (raw_data['is_recid'] != -1) &\n",
    "      (raw_data['c_charge_degree'] != 'O') & \n",
    "      (raw_data['score_text'] != 'N/A')\n",
    "     )]\n",
    "\n",
    "print('Num rows filtered: %d' % len(df))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 42,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>id</th>\n",
       "      <th>name</th>\n",
       "      <th>first</th>\n",
       "      <th>last</th>\n",
       "      <th>compas_screening_date</th>\n",
       "      <th>sex</th>\n",
       "      <th>dob</th>\n",
       "      <th>age</th>\n",
       "      <th>age_cat</th>\n",
       "      <th>race</th>\n",
       "      <th>...</th>\n",
       "      <th>v_decile_score</th>\n",
       "      <th>v_score_text</th>\n",
       "      <th>v_screening_date</th>\n",
       "      <th>in_custody</th>\n",
       "      <th>out_custody</th>\n",
       "      <th>priors_count.1</th>\n",
       "      <th>start</th>\n",
       "      <th>end</th>\n",
       "      <th>event</th>\n",
       "      <th>two_year_recid</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1</td>\n",
       "      <td>miguel hernandez</td>\n",
       "      <td>miguel</td>\n",
       "      <td>hernandez</td>\n",
       "      <td>2013-08-14</td>\n",
       "      <td>Male</td>\n",
       "      <td>1947-04-18</td>\n",
       "      <td>69</td>\n",
       "      <td>Greater than 45</td>\n",
       "      <td>Other</td>\n",
       "      <td>...</td>\n",
       "      <td>1</td>\n",
       "      <td>Low</td>\n",
       "      <td>2013-08-14</td>\n",
       "      <td>2014-07-07</td>\n",
       "      <td>2014-07-14</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>327</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>3</td>\n",
       "      <td>kevon dixon</td>\n",
       "      <td>kevon</td>\n",
       "      <td>dixon</td>\n",
       "      <td>2013-01-27</td>\n",
       "      <td>Male</td>\n",
       "      <td>1982-01-22</td>\n",
       "      <td>34</td>\n",
       "      <td>25 - 45</td>\n",
       "      <td>African-American</td>\n",
       "      <td>...</td>\n",
       "      <td>1</td>\n",
       "      <td>Low</td>\n",
       "      <td>2013-01-27</td>\n",
       "      <td>2013-01-26</td>\n",
       "      <td>2013-02-05</td>\n",
       "      <td>0</td>\n",
       "      <td>9</td>\n",
       "      <td>159</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>4</td>\n",
       "      <td>ed philo</td>\n",
       "      <td>ed</td>\n",
       "      <td>philo</td>\n",
       "      <td>2013-04-14</td>\n",
       "      <td>Male</td>\n",
       "      <td>1991-05-14</td>\n",
       "      <td>24</td>\n",
       "      <td>Less than 25</td>\n",
       "      <td>African-American</td>\n",
       "      <td>...</td>\n",
       "      <td>3</td>\n",
       "      <td>Low</td>\n",
       "      <td>2013-04-14</td>\n",
       "      <td>2013-06-16</td>\n",
       "      <td>2013-06-16</td>\n",
       "      <td>4</td>\n",
       "      <td>0</td>\n",
       "      <td>63</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>7</td>\n",
       "      <td>marsha miles</td>\n",
       "      <td>marsha</td>\n",
       "      <td>miles</td>\n",
       "      <td>2013-11-30</td>\n",
       "      <td>Male</td>\n",
       "      <td>1971-08-22</td>\n",
       "      <td>44</td>\n",
       "      <td>25 - 45</td>\n",
       "      <td>Other</td>\n",
       "      <td>...</td>\n",
       "      <td>1</td>\n",
       "      <td>Low</td>\n",
       "      <td>2013-11-30</td>\n",
       "      <td>2013-11-30</td>\n",
       "      <td>2013-12-01</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>853</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>8</td>\n",
       "      <td>edward riddle</td>\n",
       "      <td>edward</td>\n",
       "      <td>riddle</td>\n",
       "      <td>2014-02-19</td>\n",
       "      <td>Male</td>\n",
       "      <td>1974-07-23</td>\n",
       "      <td>41</td>\n",
       "      <td>25 - 45</td>\n",
       "      <td>Caucasian</td>\n",
       "      <td>...</td>\n",
       "      <td>2</td>\n",
       "      <td>Low</td>\n",
       "      <td>2014-02-19</td>\n",
       "      <td>2014-03-31</td>\n",
       "      <td>2014-04-18</td>\n",
       "      <td>14</td>\n",
       "      <td>5</td>\n",
       "      <td>40</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>5 rows × 53 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "   id              name   first       last compas_screening_date   sex  \\\n",
       "0   1  miguel hernandez  miguel  hernandez            2013-08-14  Male   \n",
       "1   3       kevon dixon   kevon      dixon            2013-01-27  Male   \n",
       "2   4          ed philo      ed      philo            2013-04-14  Male   \n",
       "5   7      marsha miles  marsha      miles            2013-11-30  Male   \n",
       "6   8     edward riddle  edward     riddle            2014-02-19  Male   \n",
       "\n",
       "          dob  age          age_cat              race  ...  v_decile_score  \\\n",
       "0  1947-04-18   69  Greater than 45             Other  ...               1   \n",
       "1  1982-01-22   34          25 - 45  African-American  ...               1   \n",
       "2  1991-05-14   24     Less than 25  African-American  ...               3   \n",
       "5  1971-08-22   44          25 - 45             Other  ...               1   \n",
       "6  1974-07-23   41          25 - 45         Caucasian  ...               2   \n",
       "\n",
       "   v_score_text  v_screening_date  in_custody  out_custody  priors_count.1  \\\n",
       "0           Low        2013-08-14  2014-07-07   2014-07-14               0   \n",
       "1           Low        2013-01-27  2013-01-26   2013-02-05               0   \n",
       "2           Low        2013-04-14  2013-06-16   2013-06-16               4   \n",
       "5           Low        2013-11-30  2013-11-30   2013-12-01               0   \n",
       "6           Low        2014-02-19  2014-03-31   2014-04-18              14   \n",
       "\n",
       "  start  end event two_year_recid  \n",
       "0     0  327     0              0  \n",
       "1     9  159     1              1  \n",
       "2     0   63     0              1  \n",
       "5     1  853     0              0  \n",
       "6     5   40     1              1  \n",
       "\n",
       "[5 rows x 53 columns]"
      ]
     },
     "execution_count": 42,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 43,
   "metadata": {},
   "outputs": [],
   "source": [
    "# https://www.rdocumentation.org/packages/fairml/versions/0.6.3/topics/compas"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 44,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "7309"
      ]
     },
     "execution_count": 44,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "len(compas_balanced[compas_balanced[\"two_year_recid\"] == 0])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 45,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Index(['id', 'name', 'first', 'last', 'compas_screening_date', 'sex', 'dob',\n",
       "       'age', 'age_cat', 'race', 'juv_fel_count', 'decile_score',\n",
       "       'juv_misd_count', 'juv_other_count', 'priors_count',\n",
       "       'days_b_screening_arrest', 'c_jail_in', 'c_jail_out', 'c_case_number',\n",
       "       'c_offense_date', 'c_arrest_date', 'c_days_from_compas',\n",
       "       'c_charge_degree', 'c_charge_desc', 'is_recid', 'r_case_number',\n",
       "       'r_charge_degree', 'r_days_from_arrest', 'r_offense_date',\n",
       "       'r_charge_desc', 'r_jail_in', 'r_jail_out', 'violent_recid',\n",
       "       'is_violent_recid', 'vr_case_number', 'vr_charge_degree',\n",
       "       'vr_offense_date', 'vr_charge_desc', 'type_of_assessment',\n",
       "       'decile_score.1', 'score_text', 'screening_date',\n",
       "       'v_type_of_assessment', 'v_decile_score', 'v_score_text',\n",
       "       'v_screening_date', 'in_custody', 'out_custody', 'priors_count.1',\n",
       "       'start', 'end', 'event', 'two_year_recid'],\n",
       "      dtype='object')"
      ]
     },
     "execution_count": 45,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.columns"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 46,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_filtered = df[[\"age\", \"juv_fel_count\", \"decile_score\", \"juv_misd_count\", \"juv_other_count\", \"v_decile_score\", \"priors_count\", \"two_year_recid\"]]\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 47,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/Users/satyapriyakrishna/opt/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:2: SettingWithCopyWarning: \n",
      "A value is trying to be set on a copy of a slice from a DataFrame.\n",
      "Try using .loc[row_indexer,col_indexer] = value instead\n",
      "\n",
      "See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy\n",
      "  \n"
     ]
    }
   ],
   "source": [
    "# df_filtered[\"two_year_recid\"].unique()\n",
    "df_filtered['two_year_recid'] = df_filtered['two_year_recid'].replace([0,1], [1,0])\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 48,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "554\n"
     ]
    }
   ],
   "source": [
    "print(len(df_filtered[df_filtered[\"two_year_recid\"] == 1]) - len(df_filtered[df_filtered[\"two_year_recid\"] == 0]))\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 49,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.utils import resample\n",
    "# import numpy as np\n",
    "# y = np.array([0,1,2,4])\n",
    "# print(resample(y, n_samples = 10, random_state = 3))\n",
    "compas_upsampled = resample(df_filtered[df_filtered[\"two_year_recid\"] == 0], n_samples = 2500, random_state = 1)\n",
    "compas_balanced = pd.concat([df_filtered, compas_upsampled], axis=0)\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 50,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<class 'pandas.core.frame.DataFrame'>\n",
      "Int64Index: 8672 entries, 0 to 5573\n",
      "Data columns (total 8 columns):\n",
      " #   Column           Non-Null Count  Dtype\n",
      "---  ------           --------------  -----\n",
      " 0   age              8672 non-null   int64\n",
      " 1   juv_fel_count    8672 non-null   int64\n",
      " 2   decile_score     8672 non-null   int64\n",
      " 3   juv_misd_count   8672 non-null   int64\n",
      " 4   juv_other_count  8672 non-null   int64\n",
      " 5   v_decile_score   8672 non-null   int64\n",
      " 6   priors_count     8672 non-null   int64\n",
      " 7   two_year_recid   8672 non-null   int64\n",
      "dtypes: int64(8)\n",
      "memory usage: 609.8 KB\n"
     ]
    }
   ],
   "source": [
    "compas_balanced.info()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 51,
   "metadata": {},
   "outputs": [],
   "source": [
    "# df_filtered.info()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 52,
   "metadata": {},
   "outputs": [],
   "source": [
    "#store without upsampling\n",
    "from sklearn.model_selection import train_test_split\n",
    "x, y = compas_balanced.drop('two_year_recid', axis=1), compas_balanced['two_year_recid']\n",
    "x_train, x_test, y_train, y_test= train_test_split(x,y, test_size=.3, random_state=55)\n",
    "x_train_df = pd.concat([x_train, y_train], axis=1)\n",
    "x_test_df = pd.concat([x_test, y_test], axis=1)\n",
    "x_train_df.to_csv(\"compas-train-processed2.csv\", index = False)\n",
    "x_test_df.to_csv(\"compas-test-processed2.csv\", index = False)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Higher COMPAS scores are slightly correlated with a longer length of stay.\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "from datetime import datetime\n",
    "from scipy.stats import pearsonr"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "def date_from_str(s):\n",
    "    return datetime.strptime(s, '%Y-%m-%d %H:%M:%S')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "# TODO: find the Pearson correlation between length of stay (jail_out - jail_in) and COMPAS decile"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "After filtering we have the following demographic breakdown:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "# TODO: find counts for each age group\n",
    "# TODO: find counts and percentages for each race group\n",
    "# TODO: find counts of each score_text group"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "# TODO: create interaction table of counts between race/sex interactions"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "# TODO: find counts and percentages for each gender group"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [],
   "source": [
    "# TODO: How many defendants had two-year recidivism in last two years? What percentage of all defendants?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Judges are often presented with two sets of scores from the Compas system -- one that classifies people into High, Medium and Low risk, and a corresponding decile score. There is a clear downward trend in the decile scores as those scores increase for white defendants.\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [],
   "source": [
    "%matplotlib inline\n",
    "\n",
    "from matplotlib import pyplot as plt"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [],
   "source": [
    "# TODO: plot decile scores by count for African-America defendants"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [],
   "source": [
    "# TODO: plot decile scores by count for Caucasian defendants"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [],
   "source": [
    "# TODO: Create a pivot table between decile_score and race"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Racial Bias in Compas\n",
    "\n",
    "After filtering out bad rows, our first question is whether there is a significant difference in Compas scores between races. To do so we need to change some variables into factors, and run a logistic regression, comparing low scores to high scores.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.linear_model import LogisticRegression"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [],
   "source": [
    "lr = LogisticRegression(solver='lbfgs')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [],
   "source": [
    "# TODO: create columns for all options for charge_degree, age, race, sex, and "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [],
   "source": [
    "# TODO: Run these input features to predict whether score_text is Low or Med/High using a Logistic Regression"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [],
   "source": [
    "# TODO: What is the correlation coefficient for African_American? What is the odds ratio?"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
