{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Preprocessing"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This notebook is intended as a companion to Section 3 - Preprocessing."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To demonstrate the various transformations discussed, we'll populate a simple toy dataset for illustrative purposes."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>number</th>\n",
       "      <th>category</th>\n",
       "      <th>binary</th>\n",
       "      <th>datetime</th>\n",
       "      <th>address</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0.00</td>\n",
       "      <td>circle</td>\n",
       "      <td>yes</td>\n",
       "      <td>1/01/2017 1:00am</td>\n",
       "      <td>1234 North Peterson St Orlando, FL 32714</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>33.00</td>\n",
       "      <td>circle</td>\n",
       "      <td>yes</td>\n",
       "      <td>3/10/2018 6:30am</td>\n",
       "      <td>2345 South Anderson St Altamonte Springs, FL 3...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>134.00</td>\n",
       "      <td>circle</td>\n",
       "      <td>yes</td>\n",
       "      <td>5/31/2018 10:15am</td>\n",
       "      <td>3456 South Peterson St Maitland, FL 32789</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>333.00</td>\n",
       "      <td>square</td>\n",
       "      <td>yes</td>\n",
       "      <td>7/26/2019 1:40pm</td>\n",
       "      <td>4567 North Peterson St Orlando, FL 32714</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>42.00</td>\n",
       "      <td>square</td>\n",
       "      <td>no</td>\n",
       "      <td>9/05/2019 6:45pm</td>\n",
       "      <td>5678 Avenue St Orlando, FL 32714</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>0.33</td>\n",
       "      <td>triangle</td>\n",
       "      <td>no</td>\n",
       "      <td>12/25/2019</td>\n",
       "      <td>6789 South Peterson St Maitland, FL 32789</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>15.50</td>\n",
       "      <td>1234</td>\n",
       "      <td>no</td>\n",
       "      <td>11:30pm</td>\n",
       "      <td>5858 North Other St Altamonte Springs, FL 32715</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>-1.00</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>None</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>NaN</td>\n",
       "      <td>square</td>\n",
       "      <td>yes</td>\n",
       "      <td>string</td>\n",
       "      <td>Orlando, FL</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   number  category binary           datetime  \\\n",
       "0    0.00    circle    yes   1/01/2017 1:00am   \n",
       "1   33.00    circle    yes   3/10/2018 6:30am   \n",
       "2  134.00    circle    yes  5/31/2018 10:15am   \n",
       "3  333.00    square    yes   7/26/2019 1:40pm   \n",
       "4   42.00    square     no   9/05/2019 6:45pm   \n",
       "5    0.33  triangle     no         12/25/2019   \n",
       "6   15.50      1234     no            11:30pm   \n",
       "7   -1.00       NaN    NaN                NaN   \n",
       "8     NaN    square    yes             string   \n",
       "\n",
       "                                             address  \n",
       "0           1234 North Peterson St Orlando, FL 32714  \n",
       "1  2345 South Anderson St Altamonte Springs, FL 3...  \n",
       "2          3456 South Peterson St Maitland, FL 32789  \n",
       "3           4567 North Peterson St Orlando, FL 32714  \n",
       "4                   5678 Avenue St Orlando, FL 32714  \n",
       "5          6789 South Peterson St Maitland, FL 32789  \n",
       "6    5858 North Other St Altamonte Springs, FL 32715  \n",
       "7                                               None  \n",
       "8                                        Orlando, FL  "
      ]
     },
     "execution_count": 1,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import pandas as pd\n",
    "import numpy as np\n",
    "\n",
    "df_train = pd.DataFrame({'number':[0, 33, 134, 333, 42, 0.33, 15.5, -1, np.nan],\n",
    "                         'category':['circle','circle','circle','square','square','triangle', 1234, np.nan, 'square'],\n",
    "                         'binary':['yes','yes','yes','yes','no','no','no',np.nan,'yes'],\n",
    "                         'datetime':['1/01/2017 1:00am', '3/10/2018 6:30am', '5/31/2018 10:15am', '7/26/2019 1:40pm', '9/05/2019 6:45pm', '12/25/2019', '11:30pm', np.nan, 'string'],\n",
    "                         'address':['1234 North Peterson St Orlando, FL 32714',\n",
    "                                   '2345 South Anderson St Altamonte Springs, FL 32715',\n",
    "                                   '3456 South Peterson St Maitland, FL 32789',\n",
    "                                   '4567 North Peterson St Orlando, FL 32714',\n",
    "                                   '5678 Avenue St Orlando, FL 32714',\n",
    "                                   '6789 South Peterson St Maitland, FL 32789',\n",
    "                                   '5858 North Other St Altamonte Springs, FL 32715',\n",
    "                                   None,\n",
    "                                   'Orlando, FL'],\n",
    "                        })\n",
    "\n",
    "df_train"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### under automation"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "#Let's run a quick automated application\n",
    "#we'll turn off shuffling for ease of inspection\n",
    "#and turn off printouts to avoid clutter\n",
    "\n",
    "from Automunge import *\n",
    "am = AutoMunge()\n",
    "\n",
    "train, train_ID, labels, \\\n",
    "val, val_ID, val_labels, \\\n",
    "test, test_ID, test_labels, \\\n",
    "postprocess_dict = \\\n",
    "am.automunge(df_train,\n",
    "             shuffletrain = False,\n",
    "             printstatus = False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Upon inspecting the returned train set we see by suffix appenders which transforms were applied as part of a returned column's derivation. In each case, the returned data is supplemented with a 'NArw' column signalling presence of infill.\n",
    "\n",
    "The numeric feature \"number\" had a 'nmbr' transform applied which is a z-score normalization. \n",
    "\n",
    "The categoric feature \"binary\" had a 'bnry' which is a single column binarization.\n",
    "\n",
    "The categoric feature \"category\" also had a binarization by the '1010' transform returning three columns. \n",
    "\n",
    "The datetime feature \"datetime\" had entries segregated by time scale (year, month/day, hour/minute/second) with seperate transforms by sine and cosine to accomodate periodicity, and also had bins aggregated for business hours, weekdays, and holidays. Note that the intermediate transform 'tmzn' visible by suffix appender is there to allow designation of a desired timezone by parameter. \n",
    "\n",
    "The final categoric set, \"address\", was subjected to a hashing (aka the \"hashing trick\"), where in this case because the set had all unique entries was given a parsed hashing in which distinct words found within entries based on a space seperator were individually encoded. (Returning seven 'hash' columns because the max word count in an entry was 7 words.)\n",
    "\n",
    "Not shown, another scenario of hashing is applied under automation for cases where the all unique threshold hasn't been reached but number of unique entries in a categoric set is still above the configurable \"numbercategoryheuristic\" threshold which defaults to 255. In this case hashing is applied to aggregate unique entries without parsing for distinct words."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>number_nmbr</th>\n",
       "      <th>binary_bnry</th>\n",
       "      <th>number_NArw</th>\n",
       "      <th>category_NArw</th>\n",
       "      <th>category_1010_0</th>\n",
       "      <th>category_1010_1</th>\n",
       "      <th>category_1010_2</th>\n",
       "      <th>binary_NArw</th>\n",
       "      <th>datetime_NArw</th>\n",
       "      <th>datetime_tmzn_year</th>\n",
       "      <th>datetime_tmzn_mdsn</th>\n",
       "      <th>datetime_tmzn_mdcs</th>\n",
       "      <th>datetime_tmzn_hmss</th>\n",
       "      <th>datetime_tmzn_hmsc</th>\n",
       "      <th>datetime_tmzn_bshr</th>\n",
       "      <th>datetime_tmzn_wkdy</th>\n",
       "      <th>datetime_tmzn_hldy</th>\n",
       "      <th>address_NArw</th>\n",
       "      <th>address_hash_0</th>\n",
       "      <th>address_hash_1</th>\n",
       "      <th>address_hash_2</th>\n",
       "      <th>address_hash_3</th>\n",
       "      <th>address_hash_4</th>\n",
       "      <th>address_hash_5</th>\n",
       "      <th>address_hash_6</th>\n",
       "      <th>address_hash_7</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>-0.644929</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>-1.500117</td>\n",
       "      <td>5.145554e-01</td>\n",
       "      <td>0.857457</td>\n",
       "      <td>0.258819</td>\n",
       "      <td>0.965926</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>5</td>\n",
       "      <td>4</td>\n",
       "      <td>23</td>\n",
       "      <td>19</td>\n",
       "      <td>24</td>\n",
       "      <td>23</td>\n",
       "      <td>16</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>-0.339160</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>-0.825064</td>\n",
       "      <td>9.857698e-01</td>\n",
       "      <td>-0.168101</td>\n",
       "      <td>0.991445</td>\n",
       "      <td>-0.130526</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>19</td>\n",
       "      <td>20</td>\n",
       "      <td>4</td>\n",
       "      <td>19</td>\n",
       "      <td>30</td>\n",
       "      <td>31</td>\n",
       "      <td>23</td>\n",
       "      <td>41</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>0.596678</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>-0.825064</td>\n",
       "      <td>1.224647e-16</td>\n",
       "      <td>-1.000000</td>\n",
       "      <td>0.442289</td>\n",
       "      <td>-0.896873</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>24</td>\n",
       "      <td>20</td>\n",
       "      <td>23</td>\n",
       "      <td>19</td>\n",
       "      <td>43</td>\n",
       "      <td>23</td>\n",
       "      <td>12</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>2.440556</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>-0.150012</td>\n",
       "      <td>-8.207635e-01</td>\n",
       "      <td>-0.571268</td>\n",
       "      <td>-0.422618</td>\n",
       "      <td>-0.906308</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>32</td>\n",
       "      <td>4</td>\n",
       "      <td>23</td>\n",
       "      <td>19</td>\n",
       "      <td>24</td>\n",
       "      <td>23</td>\n",
       "      <td>16</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>-0.255769</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>-0.150012</td>\n",
       "      <td>-9.961947e-01</td>\n",
       "      <td>0.087156</td>\n",
       "      <td>-0.980785</td>\n",
       "      <td>0.195090</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>20</td>\n",
       "      <td>37</td>\n",
       "      <td>19</td>\n",
       "      <td>24</td>\n",
       "      <td>23</td>\n",
       "      <td>16</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>-0.641871</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>-0.150012</td>\n",
       "      <td>4.098203e-01</td>\n",
       "      <td>0.912166</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>29</td>\n",
       "      <td>20</td>\n",
       "      <td>23</td>\n",
       "      <td>19</td>\n",
       "      <td>43</td>\n",
       "      <td>23</td>\n",
       "      <td>12</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>-0.501310</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1.200094</td>\n",
       "      <td>-3.489950e-02</td>\n",
       "      <td>-0.999391</td>\n",
       "      <td>-0.130526</td>\n",
       "      <td>0.991445</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>16</td>\n",
       "      <td>4</td>\n",
       "      <td>38</td>\n",
       "      <td>19</td>\n",
       "      <td>30</td>\n",
       "      <td>31</td>\n",
       "      <td>23</td>\n",
       "      <td>41</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>-0.654195</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>-0.649551</td>\n",
       "      <td>-1.783955e-02</td>\n",
       "      <td>0.459045</td>\n",
       "      <td>-0.011968</td>\n",
       "      <td>0.440456</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>24</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>-0.360567</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>-0.312024</td>\n",
       "      <td>-3.757586e-01</td>\n",
       "      <td>-0.183430</td>\n",
       "      <td>-0.190840</td>\n",
       "      <td>-0.094732</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>24</td>\n",
       "      <td>23</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   number_nmbr  binary_bnry  number_NArw  category_NArw  category_1010_0  \\\n",
       "0    -0.644929            1            0              0                0   \n",
       "1    -0.339160            1            0              0                0   \n",
       "2     0.596678            1            0              0                0   \n",
       "3     2.440556            1            0              0                0   \n",
       "4    -0.255769            0            0              0                0   \n",
       "5    -0.641871            0            0              0                0   \n",
       "6    -0.501310            0            0              0                0   \n",
       "7    -0.654195            1            0              1                0   \n",
       "8    -0.360567            1            1              0                0   \n",
       "\n",
       "   category_1010_1  category_1010_2  binary_NArw  datetime_NArw  \\\n",
       "0                0                1            0              0   \n",
       "1                0                1            0              0   \n",
       "2                0                1            0              0   \n",
       "3                1                0            0              0   \n",
       "4                1                0            0              0   \n",
       "5                1                1            0              0   \n",
       "6                0                0            0              0   \n",
       "7                0                0            1              1   \n",
       "8                1                0            0              1   \n",
       "\n",
       "   datetime_tmzn_year  datetime_tmzn_mdsn  datetime_tmzn_mdcs  \\\n",
       "0           -1.500117        5.145554e-01            0.857457   \n",
       "1           -0.825064        9.857698e-01           -0.168101   \n",
       "2           -0.825064        1.224647e-16           -1.000000   \n",
       "3           -0.150012       -8.207635e-01           -0.571268   \n",
       "4           -0.150012       -9.961947e-01            0.087156   \n",
       "5           -0.150012        4.098203e-01            0.912166   \n",
       "6            1.200094       -3.489950e-02           -0.999391   \n",
       "7           -0.649551       -1.783955e-02            0.459045   \n",
       "8           -0.312024       -3.757586e-01           -0.183430   \n",
       "\n",
       "   datetime_tmzn_hmss  datetime_tmzn_hmsc  datetime_tmzn_bshr  \\\n",
       "0            0.258819            0.965926                   0   \n",
       "1            0.991445           -0.130526                   0   \n",
       "2            0.442289           -0.896873                   1   \n",
       "3           -0.422618           -0.906308                   1   \n",
       "4           -0.980785            0.195090                   0   \n",
       "5            0.000000            1.000000                   0   \n",
       "6           -0.130526            0.991445                   0   \n",
       "7           -0.011968            0.440456                   0   \n",
       "8           -0.190840           -0.094732                   0   \n",
       "\n",
       "   datetime_tmzn_wkdy  datetime_tmzn_hldy  address_NArw  address_hash_0  \\\n",
       "0                   0                   0             0               5   \n",
       "1                   0                   0             0              19   \n",
       "2                   1                   0             0              24   \n",
       "3                   1                   0             0              32   \n",
       "4                   1                   0             0              20   \n",
       "5                   1                   1             0              29   \n",
       "6                   1                   0             0              16   \n",
       "7                   1                   0             1              24   \n",
       "8                   1                   0             0              24   \n",
       "\n",
       "   address_hash_1  address_hash_2  address_hash_3  address_hash_4  \\\n",
       "0               4              23              19              24   \n",
       "1              20               4              19              30   \n",
       "2              20              23              19              43   \n",
       "3               4              23              19              24   \n",
       "4              37              19              24              23   \n",
       "5              20              23              19              43   \n",
       "6               4              38              19              30   \n",
       "7               0               0               0               0   \n",
       "8              23               0               0               0   \n",
       "\n",
       "   address_hash_5  address_hash_6  address_hash_7  \n",
       "0              23              16               0  \n",
       "1              31              23              41  \n",
       "2              23              12               0  \n",
       "3              23              16               0  \n",
       "4              16               0               0  \n",
       "5              23              12               0  \n",
       "6              31              23              41  \n",
       "7               0               0               0  \n",
       "8               0               0               0  "
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pd.set_option('display.max_columns', 100)\n",
    "\n",
    "train"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Note that in some cases the set of columns returned from an input feature may not all be adjacent in the returned dataframe. This was an intentional design decision, if grouping coherence is desired a variation can be applied with the assignparam parameter which will activate grouping coherence of returned sets. \n",
    "\n",
    "```\n",
    "assignparam = {'global_assignparam' : {'inplace' : False}}\n",
    "```\n",
    "Otherwise the set of columns returned from an input feature can be accessed from the returned postprocess_dict by:\n",
    "```\n",
    "postprocess_dict['column_map']['<input_feature_header>']\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>binary_bnry</th>\n",
       "      <th>binary_NArw</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   binary_bnry  binary_NArw\n",
       "0            1            0\n",
       "1            1            0\n",
       "2            1            0\n",
       "3            1            0\n",
       "4            0            0\n",
       "5            0            0\n",
       "6            0            0\n",
       "7            1            1\n",
       "8            1            0"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "train[postprocess_dict['column_map']['binary']]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To be complete, we'll quickly demonstrate the defaults under automation for label sets:"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Numeric labels are left un-normalized with the 'exc2' transform which is bassically a pass-through transform for numeric sets. Note that label sets are not supplemented with the NArw column. Since this returns a single column labels set it is received as a Pandas series instead of a dataframe."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0      0.00\n",
       "1     33.00\n",
       "2    134.00\n",
       "3    333.00\n",
       "4     42.00\n",
       "5      0.33\n",
       "6     15.50\n",
       "7     -1.00\n",
       "8     -1.00\n",
       "Name: number_exc2, dtype: float32"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#numeric label set\n",
    "\n",
    "train, train_ID, labels, \\\n",
    "val, val_ID, val_labels, \\\n",
    "test, test_ID, test_labels, \\\n",
    "postprocess_dict = \\\n",
    "am.automunge(df_train,\n",
    "             labels_column = 'number',\n",
    "             shuffletrain = False,\n",
    "             printstatus = False)\n",
    "\n",
    "labels"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Categoric labels are instead given an ordinal transform by the 'ordl' transform. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0    1\n",
       "1    1\n",
       "2    1\n",
       "3    2\n",
       "4    2\n",
       "5    3\n",
       "6    0\n",
       "7    4\n",
       "8    2\n",
       "Name: category_ordl, dtype: uint8"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#categoric label set\n",
    "\n",
    "train, train_ID, labels, \\\n",
    "val, val_ID, val_labels, \\\n",
    "test, test_ID, test_labels, \\\n",
    "postprocess_dict = \\\n",
    "am.automunge(df_train,\n",
    "             labels_column = 'category',\n",
    "             shuffletrain = False,\n",
    "             printstatus = False)\n",
    "\n",
    "labels"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### assigning transforms"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The paper notes several types of custom transforms available for assignment, here we'll demonstrate a few."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "__________"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\"Numeric features may be assigned to any range of transformations, normalizations, and bin aggregations\"\n",
    "\n",
    "=> there is a lot to choose from. Here we'll demonstrate a transformation set (assembled with our \"family tree primitives\") which includes an upstream log transform 'log0' followed by a downstream min/max scaling 'mnmx' as well as equal population bin aggregations 'bnep', which we'll asign to our numeric feature.\n",
    "\n",
    "We'll pass a parameter to the bin aggregations to designate the bin count."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>number_NArw</th>\n",
       "      <th>number_log0_mnmx</th>\n",
       "      <th>number_bnep_0</th>\n",
       "      <th>number_bnep_1</th>\n",
       "      <th>number_bnep_2</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1</td>\n",
       "      <td>0.631899</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>0</td>\n",
       "      <td>0.665794</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>0</td>\n",
       "      <td>0.868393</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>0</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>0</td>\n",
       "      <td>0.700661</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>0</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>0</td>\n",
       "      <td>0.556543</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>1</td>\n",
       "      <td>0.631899</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>1</td>\n",
       "      <td>0.631899</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   number_NArw  number_log0_mnmx  number_bnep_0  number_bnep_1  number_bnep_2\n",
       "0            1          0.631899              1              0              0\n",
       "1            0          0.665794              0              1              0\n",
       "2            0          0.868393              0              0              1\n",
       "3            0          1.000000              0              0              1\n",
       "4            0          0.700661              0              0              1\n",
       "5            0          0.000000              1              0              0\n",
       "6            0          0.556543              0              1              0\n",
       "7            1          0.631899              1              0              0\n",
       "8            1          0.631899              0              0              0"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#family trees are defined in the transformdict\n",
    "#here the upstream transforms are newt, bnep, and NArw\n",
    "#and since parents is a primitive with offspring\n",
    "#the downstream primitives are inspected \n",
    "#and mnmx is performed downstream of newt\n",
    "transformdict =  {'newt' : {'parents' : ['newt'], \\\n",
    "                            'siblings': [], \\\n",
    "                            'auntsuncles' : ['bnep'], \\\n",
    "                            'cousins' : ['NArw'], \\\n",
    "                            'children' : [], \\\n",
    "                            'niecesnephews' : [], \\\n",
    "                            'coworkers' : ['mnmx'], \\\n",
    "                            'friends' : []}}\n",
    "\n",
    "#since we're defining a new root category 'newt' in a transformdict\n",
    "#we'll need a corresponding processdict entry\n",
    "#we'll designate that the 'newt' category\n",
    "#is associated with a log0 transform\n",
    "#as applied above in the parents primitive\n",
    "processdict = {'newt' : {'functionpointer' : 'log0'}}\n",
    "\n",
    "#we'll assign this root category to column 'number'\n",
    "assigncat = {'newt' : ['number']}\n",
    "\n",
    "#since bnep returns 5 bins and our toy data set only has 9 entries\n",
    "#let's go ahead and set a parameter to return fewer bins\n",
    "#bnep accepts parameter 'bincount' as documented in read me\n",
    "assignparam = {'default_assignparam' : {'bnep' : {'bincount' : 3}}}\n",
    "\n",
    "#Then go ahead and implement automunge\n",
    "#since this is a toy dataset we'll turn off ml infill\n",
    "\n",
    "train, train_ID, labels, \\\n",
    "val, val_ID, val_labels, \\\n",
    "test, test_ID, test_labels, \\\n",
    "postprocess_dict = \\\n",
    "am.automunge(df_train,\n",
    "             shuffletrain = False,\n",
    "             printstatus = False,\n",
    "             transformdict = transformdict,\n",
    "             processdict = processdict, \n",
    "             assigncat = assigncat,\n",
    "             MLinfill = False,\n",
    "             assignparam = assignparam)\n",
    "\n",
    "#and we can then inspect the returned train set to view output\n",
    "train[postprocess_dict['column_map']['number']]\n",
    "\n",
    "#note that since the root category had a processdict entry derived \n",
    "#from a functionpointer to log0\n",
    "#the associated NArowtype is applied which treats entries <=0 as subject to infill"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0      0.00\n",
       "1     33.00\n",
       "2    134.00\n",
       "3    333.00\n",
       "4     42.00\n",
       "5      0.33\n",
       "6     15.50\n",
       "7     -1.00\n",
       "8       NaN\n",
       "Name: number, dtype: float64"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#Here again is the input column that was the source for comparison\n",
    "\n",
    "df_train['number']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "__________"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\"Sequential numeric features may be supplemented by proxies for derivatives\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "There are several pre-defined transformation sets that may serve to supplment sequential numeric features with proxies for derivatives (where by sequential we mean like time series data). \n",
    "\n",
    "The dxdt family of transforms (dxdt, d2dt, d3dt, etc) approximate a derivative by deltas between adjacent time steps followed by a normalization. Note that if a larger interval is desired for the delta a user can pass the 'periods' parameter through assignparam to designate.\n",
    "\n",
    "The dxdt root category returns the original data normalized by the retn transform supplemented by a dxdt proxy for first order derivative (like velocity) also normalized by retn, i.e. when dxdt is assigned to an input column with header 'column', it would return the set {'column_retn', 'column_dxdt_retn', 'column_NArw'}. Similarily, the d2dt root category would return an additional column with a proxy for a second order derivative (like acceleration) as the set {'column_retn', 'column_dxdt_retn', 'column_dxdt_dxdt_retn', 'column_NArw'}.\n",
    "\n",
    "(As an aside, the retn transform is a type of normalization with sign retention, more detail in the paper Numeric Encoding Options with Automunge.)\n",
    "\n",
    "Another varient is available by the dxd2 family (dxd2, d2d2, d3d2, etc) which applies a more denoised signal by subtracting the average of multiple time steps between some interval. (In base configuration this is average of last two rows minus average of preceding two rows, with increased intervals available again by the 'periods' parameter.)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>number_retn</th>\n",
       "      <th>number_NArw</th>\n",
       "      <th>number_dxdt_retn</th>\n",
       "      <th>number_dxdt_dxdt_retn</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0.000000</td>\n",
       "      <td>0</td>\n",
       "      <td>0.067347</td>\n",
       "      <td>0.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>0.098802</td>\n",
       "      <td>0</td>\n",
       "      <td>0.067347</td>\n",
       "      <td>0.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>0.401198</td>\n",
       "      <td>0</td>\n",
       "      <td>0.206122</td>\n",
       "      <td>0.091975</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>0.997006</td>\n",
       "      <td>0</td>\n",
       "      <td>0.406122</td>\n",
       "      <td>0.132552</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>0.125749</td>\n",
       "      <td>0</td>\n",
       "      <td>-0.593878</td>\n",
       "      <td>-0.662762</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>0.000988</td>\n",
       "      <td>0</td>\n",
       "      <td>-0.085041</td>\n",
       "      <td>0.337238</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>0.046407</td>\n",
       "      <td>0</td>\n",
       "      <td>0.030959</td>\n",
       "      <td>0.076880</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>-0.002994</td>\n",
       "      <td>0</td>\n",
       "      <td>-0.033673</td>\n",
       "      <td>-0.042836</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>0.208394</td>\n",
       "      <td>1</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.022318</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   number_retn  number_NArw  number_dxdt_retn  number_dxdt_dxdt_retn\n",
       "0     0.000000            0          0.067347               0.000000\n",
       "1     0.098802            0          0.067347               0.000000\n",
       "2     0.401198            0          0.206122               0.091975\n",
       "3     0.997006            0          0.406122               0.132552\n",
       "4     0.125749            0         -0.593878              -0.662762\n",
       "5     0.000988            0         -0.085041               0.337238\n",
       "6     0.046407            0          0.030959               0.076880\n",
       "7    -0.002994            0         -0.033673              -0.042836\n",
       "8     0.208394            1          0.000000               0.022318"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#to demonstrate let's assign the d2dt root category to column 'number'\n",
    "assigncat = {'d2dt' : ['number']}\n",
    "\n",
    "#Then go ahead and implement automunge\n",
    "#since this is a toy dataset we'll turn off ml infill\n",
    "\n",
    "train, train_ID, labels, \\\n",
    "val, val_ID, val_labels, \\\n",
    "test, test_ID, test_labels, \\\n",
    "postprocess_dict = \\\n",
    "am.automunge(df_train,\n",
    "             shuffletrain = False,\n",
    "             printstatus = False,\n",
    "             assigncat = assigncat,\n",
    "             MLinfill = False)\n",
    "\n",
    "#and we can then inspect the returned train set to view output\n",
    "train[postprocess_dict['column_map']['number']]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0      0.00\n",
       "1     33.00\n",
       "2    134.00\n",
       "3    333.00\n",
       "4     42.00\n",
       "5      0.33\n",
       "6     15.50\n",
       "7     -1.00\n",
       "8       NaN\n",
       "Name: number, dtype: float64"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#Here again is the input column that was the source for comparison\n",
    "\n",
    "df_train['number']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "__________"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\"Categoric features may be subject to encodings like ordinal, one-hot, binarization, hashing,\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's go ahead and demonstrate each of these side by side for comparison."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>category_hash</th>\n",
       "      <th>category_NArw</th>\n",
       "      <th>category_ord3</th>\n",
       "      <th>category_1234</th>\n",
       "      <th>category_circle</th>\n",
       "      <th>category_square</th>\n",
       "      <th>category_triangle</th>\n",
       "      <th>category_1010_0</th>\n",
       "      <th>category_1010_1</th>\n",
       "      <th>category_1010_2</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>5</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>5</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>5</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>6</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>6</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>9</td>\n",
       "      <td>0</td>\n",
       "      <td>3</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>5</td>\n",
       "      <td>0</td>\n",
       "      <td>2</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>6</td>\n",
       "      <td>1</td>\n",
       "      <td>4</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>6</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   category_hash  category_NArw  category_ord3  category_1234  \\\n",
       "0              5              0              0              0   \n",
       "1              5              0              0              0   \n",
       "2              5              0              0              0   \n",
       "3              6              0              1              0   \n",
       "4              6              0              1              0   \n",
       "5              9              0              3              0   \n",
       "6              5              0              2              1   \n",
       "7              6              1              4              0   \n",
       "8              6              0              1              0   \n",
       "\n",
       "   category_circle  category_square  category_triangle  category_1010_0  \\\n",
       "0                1                0                  0                0   \n",
       "1                1                0                  0                0   \n",
       "2                1                0                  0                0   \n",
       "3                0                1                  0                0   \n",
       "4                0                1                  0                0   \n",
       "5                0                0                  1                0   \n",
       "6                0                0                  0                0   \n",
       "7                0                0                  0                1   \n",
       "8                0                1                  0                0   \n",
       "\n",
       "   category_1010_1  category_1010_2  \n",
       "0                0                1  \n",
       "1                0                1  \n",
       "2                0                1  \n",
       "3                1                0  \n",
       "4                1                0  \n",
       "5                1                1  \n",
       "6                0                0  \n",
       "7                0                0  \n",
       "8                1                0  "
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#family trees are defined in the transformdict\n",
    "#here we'll apply ordinal via 'ord3', one-hot via 'text', \n",
    "#binarization via '1010', and hashing via 'hash'\n",
    "\n",
    "\n",
    "transformdict =  {'newt' : {'parents' : [], \\\n",
    "                            'siblings': [], \\\n",
    "                            'auntsuncles' : ['ord3', 'text', '1010', 'hash'], \\\n",
    "                            'cousins' : ['NArw'], \\\n",
    "                            'children' : [], \\\n",
    "                            'niecesnephews' : [], \\\n",
    "                            'coworkers' : [], \\\n",
    "                            'friends' : []}}\n",
    "\n",
    "#since we're defining a new root category 'newt' in a transformdict\n",
    "#we'll need a corresponding processdict entry\n",
    "#we'll just apply comparable to ord3 which is an arbitrary choice\n",
    "processdict = {'newt' : {'functionpointer' : 'ord3'}}\n",
    "\n",
    "#we'll assign this root category to column 'category'\n",
    "assigncat = {'newt' : ['category']}\n",
    "\n",
    "#Then go ahead and implement automunge\n",
    "#since this is a toy dataset we'll turn off ml infill\n",
    "\n",
    "train, train_ID, labels, \\\n",
    "val, val_ID, val_labels, \\\n",
    "test, test_ID, test_labels, \\\n",
    "postprocess_dict = \\\n",
    "am.automunge(df_train,\n",
    "             shuffletrain = False,\n",
    "             printstatus = False,\n",
    "             transformdict = transformdict,\n",
    "             processdict = processdict, \n",
    "             assigncat = assigncat,\n",
    "             MLinfill = False)\n",
    "\n",
    "#and we can then inspect the returned train set to view output\n",
    "train[postprocess_dict['column_map']['category']]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here is a summary of each transformation category and associated returned columns:\n",
    "- hash: category_hash\n",
    "- NArw: category_NArw\n",
    "- ord3: category_ord3\n",
    "- text: category_1234, category_circle, category_square, category_triangle\n",
    "- 1010: category_1010_0, category_1010_1, category_1010_2\n",
    "\n",
    "Note that the 'text' one hot encoding is a little unique in that suffix appenders are associated with the activation. Another varient of one-hot encoding is available with privacy preserving suffix appenders as category 'onht'."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0      circle\n",
       "1      circle\n",
       "2      circle\n",
       "3      square\n",
       "4      square\n",
       "5    triangle\n",
       "6        1234\n",
       "7         NaN\n",
       "8      square\n",
       "Name: category, dtype: object"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#Here again is the input column that was the source for comparison\n",
    "\n",
    "df_train['category']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "__________"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\"or even parsed categoric encoding [11] with an increased information retention in comparison to one-hot encoding by a vectorization as a function of grammatical structure shared between entries.\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "There are several varients of parsed categoric encodings in teh library. We go over them in some detail in the paper Parsed Categoric Encodings with Automunge, particularily the or19 root category. Here we'll demonstrate another varient available in library as or23 which is similar to or19 but replaces the spl9/sp10 chain with a sp19 which is similar to splt with concurrent activations but with returned activations aggregated into a binarization. (Forgive the esoteric vocabulary, for more detail see the paper Parsed Categoric Encodings.)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>address_UPCS_ord3</th>\n",
       "      <th>address_NArw</th>\n",
       "      <th>address_UPCS_nmcm</th>\n",
       "      <th>address_UPCS_sp19_0</th>\n",
       "      <th>address_UPCS_sp19_1</th>\n",
       "      <th>address_UPCS_sp19_2</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>32714.000000</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>32715.000000</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>2</td>\n",
       "      <td>0</td>\n",
       "      <td>32789.000000</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>3</td>\n",
       "      <td>0</td>\n",
       "      <td>32714.000000</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>4</td>\n",
       "      <td>0</td>\n",
       "      <td>32714.000000</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>6</td>\n",
       "      <td>0</td>\n",
       "      <td>32789.000000</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>5</td>\n",
       "      <td>0</td>\n",
       "      <td>32715.000000</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>8</td>\n",
       "      <td>1</td>\n",
       "      <td>32735.714844</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>7</td>\n",
       "      <td>0</td>\n",
       "      <td>32735.714844</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   address_UPCS_ord3  address_NArw  address_UPCS_nmcm  address_UPCS_sp19_0  \\\n",
       "0                  0             0       32714.000000                    1   \n",
       "1                  1             0       32715.000000                    1   \n",
       "2                  2             0       32789.000000                    1   \n",
       "3                  3             0       32714.000000                    1   \n",
       "4                  4             0       32714.000000                    0   \n",
       "5                  6             0       32789.000000                    1   \n",
       "6                  5             0       32715.000000                    0   \n",
       "7                  8             1       32735.714844                    0   \n",
       "8                  7             0       32735.714844                    0   \n",
       "\n",
       "   address_UPCS_sp19_1  address_UPCS_sp19_2  \n",
       "0                    0                    1  \n",
       "1                    0                    0  \n",
       "2                    1                    0  \n",
       "3                    0                    1  \n",
       "4                    1                    0  \n",
       "5                    1                    0  \n",
       "6                    1                    1  \n",
       "7                    0                    0  \n",
       "8                    0                    1  "
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#we'll assign the or23 root category to column 'address'\n",
    "assigncat = {'or23' : ['address']}\n",
    "\n",
    "#Then go ahead and implement automunge\n",
    "#since this is a toy dataset we'll turn off ml infill\n",
    "\n",
    "train, train_ID, labels, \\\n",
    "val, val_ID, val_labels, \\\n",
    "test, test_ID, test_labels, \\\n",
    "postprocess_dict = \\\n",
    "am.automunge(df_train,\n",
    "             shuffletrain = False,\n",
    "             printstatus = False,\n",
    "             assigncat = assigncat,\n",
    "             MLinfill = False)\n",
    "\n",
    "#and we can then inspect the returned train set to view output\n",
    "train[postprocess_dict['column_map']['address']]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0             1234 North Peterson St Orlando, FL 32714\n",
       "1    2345 South Anderson St Altamonte Springs, FL 3...\n",
       "2            3456 South Peterson St Maitland, FL 32789\n",
       "3             4567 North Peterson St Orlando, FL 32714\n",
       "4                     5678 Avenue St Orlando, FL 32714\n",
       "5            6789 South Peterson St Maitland, FL 32789\n",
       "6      5858 North Other St Altamonte Springs, FL 32715\n",
       "7                                                 None\n",
       "8                                          Orlando, FL\n",
       "Name: address, dtype: object"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#Here again is the input column that was the source for comparison\n",
    "\n",
    "df_train['address']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For additional comparison, here is the or19 version that was discussed in detail in Parsed Categoric Encodings with Automunge."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>address_NArw</th>\n",
       "      <th>address_UPCS_spl9_ord3</th>\n",
       "      <th>address_UPCS_spl9_sp10_ord3</th>\n",
       "      <th>address_UPCS_nmc7_nmbr</th>\n",
       "      <th>address_UPCS_1010_0</th>\n",
       "      <th>address_UPCS_1010_1</th>\n",
       "      <th>address_UPCS_1010_2</th>\n",
       "      <th>address_UPCS_1010_3</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>-0.688760</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>0</td>\n",
       "      <td>2</td>\n",
       "      <td>0</td>\n",
       "      <td>-0.657041</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>1.690181</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>-0.688760</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>0</td>\n",
       "      <td>3</td>\n",
       "      <td>1</td>\n",
       "      <td>-0.688760</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>1.690181</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>0</td>\n",
       "      <td>2</td>\n",
       "      <td>0</td>\n",
       "      <td>-0.657041</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>1</td>\n",
       "      <td>5</td>\n",
       "      <td>0</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>0</td>\n",
       "      <td>4</td>\n",
       "      <td>0</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   address_NArw  address_UPCS_spl9_ord3  address_UPCS_spl9_sp10_ord3  \\\n",
       "0             0                       0                            1   \n",
       "1             0                       2                            0   \n",
       "2             0                       1                            2   \n",
       "3             0                       0                            1   \n",
       "4             0                       3                            1   \n",
       "5             0                       1                            2   \n",
       "6             0                       2                            0   \n",
       "7             1                       5                            0   \n",
       "8             0                       4                            0   \n",
       "\n",
       "   address_UPCS_nmc7_nmbr  address_UPCS_1010_0  address_UPCS_1010_1  \\\n",
       "0               -0.688760                    0                    0   \n",
       "1               -0.657041                    0                    0   \n",
       "2                1.690181                    0                    0   \n",
       "3               -0.688760                    0                    0   \n",
       "4               -0.688760                    0                    1   \n",
       "5                1.690181                    0                    1   \n",
       "6               -0.657041                    0                    1   \n",
       "7                0.000000                    1                    0   \n",
       "8                0.000000                    0                    1   \n",
       "\n",
       "   address_UPCS_1010_2  address_UPCS_1010_3  \n",
       "0                    0                    0  \n",
       "1                    0                    1  \n",
       "2                    1                    0  \n",
       "3                    1                    1  \n",
       "4                    0                    0  \n",
       "5                    1                    0  \n",
       "6                    0                    1  \n",
       "7                    0                    0  \n",
       "8                    1                    1  "
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#we'll assign the or19 root category to column 'address'\n",
    "assigncat = {'or19' : ['address']}\n",
    "\n",
    "#Then go ahead and implement automunge\n",
    "#since this is a toy dataset we'll turn off ml infill\n",
    "\n",
    "train, train_ID, labels, \\\n",
    "val, val_ID, val_labels, \\\n",
    "test, test_ID, test_labels, \\\n",
    "postprocess_dict = \\\n",
    "am.automunge(df_train,\n",
    "             shuffletrain = False,\n",
    "             printstatus = False,\n",
    "             assigncat = assigncat,\n",
    "             MLinfill = False)\n",
    "\n",
    "#and we can then inspect the returned train set to view output\n",
    "train[postprocess_dict['column_map']['address']]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "__________"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\"Categoric sets may be collectively aggregated into a single common binarization. \""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The aggregation of categoric sets into a common binarization is a form of dimensionality reduction available in the library, available for activation by the Binary automugne(.) parameter. Note that this only aggregates categoric entries populated with boolean integer entries (so not applied to ordinal or hashed sets). Note that inversion is supported."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "First let's demonstrate on our toy data set without the Binary dimensionality reduction applied. We'll assign two categoric sets to '1010' which means they will be individually binarized."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>number_nmbr</th>\n",
       "      <th>number_NArw</th>\n",
       "      <th>category_NArw</th>\n",
       "      <th>category_1010_0</th>\n",
       "      <th>category_1010_1</th>\n",
       "      <th>category_1010_2</th>\n",
       "      <th>binary_NArw</th>\n",
       "      <th>binary_1010_0</th>\n",
       "      <th>binary_1010_1</th>\n",
       "      <th>datetime_NArw</th>\n",
       "      <th>datetime_tmzn_year</th>\n",
       "      <th>datetime_tmzn_mdsn</th>\n",
       "      <th>datetime_tmzn_mdcs</th>\n",
       "      <th>datetime_tmzn_hmss</th>\n",
       "      <th>datetime_tmzn_hmsc</th>\n",
       "      <th>datetime_tmzn_bshr</th>\n",
       "      <th>datetime_tmzn_wkdy</th>\n",
       "      <th>datetime_tmzn_hldy</th>\n",
       "      <th>address_NArw</th>\n",
       "      <th>address_hash_0</th>\n",
       "      <th>address_hash_1</th>\n",
       "      <th>address_hash_2</th>\n",
       "      <th>address_hash_3</th>\n",
       "      <th>address_hash_4</th>\n",
       "      <th>address_hash_5</th>\n",
       "      <th>address_hash_6</th>\n",
       "      <th>address_hash_7</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>-0.644929</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>-1.500117</td>\n",
       "      <td>5.145554e-01</td>\n",
       "      <td>0.857457</td>\n",
       "      <td>0.258819</td>\n",
       "      <td>0.965926</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>5</td>\n",
       "      <td>4</td>\n",
       "      <td>23</td>\n",
       "      <td>19</td>\n",
       "      <td>24</td>\n",
       "      <td>23</td>\n",
       "      <td>16</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>-0.339160</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>-0.825064</td>\n",
       "      <td>9.857698e-01</td>\n",
       "      <td>-0.168101</td>\n",
       "      <td>0.991445</td>\n",
       "      <td>-0.130526</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>19</td>\n",
       "      <td>20</td>\n",
       "      <td>4</td>\n",
       "      <td>19</td>\n",
       "      <td>30</td>\n",
       "      <td>31</td>\n",
       "      <td>23</td>\n",
       "      <td>41</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>0.596678</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>-0.825064</td>\n",
       "      <td>1.224647e-16</td>\n",
       "      <td>-1.000000</td>\n",
       "      <td>0.442289</td>\n",
       "      <td>-0.896873</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>24</td>\n",
       "      <td>20</td>\n",
       "      <td>23</td>\n",
       "      <td>19</td>\n",
       "      <td>43</td>\n",
       "      <td>23</td>\n",
       "      <td>12</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>2.440556</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>-0.150012</td>\n",
       "      <td>-8.207635e-01</td>\n",
       "      <td>-0.571268</td>\n",
       "      <td>-0.422618</td>\n",
       "      <td>-0.906308</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>32</td>\n",
       "      <td>4</td>\n",
       "      <td>23</td>\n",
       "      <td>19</td>\n",
       "      <td>24</td>\n",
       "      <td>23</td>\n",
       "      <td>16</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>-0.255769</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>-0.150012</td>\n",
       "      <td>-9.961947e-01</td>\n",
       "      <td>0.087156</td>\n",
       "      <td>-0.980785</td>\n",
       "      <td>0.195090</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>20</td>\n",
       "      <td>37</td>\n",
       "      <td>19</td>\n",
       "      <td>24</td>\n",
       "      <td>23</td>\n",
       "      <td>16</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>-0.641871</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>-0.150012</td>\n",
       "      <td>4.098203e-01</td>\n",
       "      <td>0.912166</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>29</td>\n",
       "      <td>20</td>\n",
       "      <td>23</td>\n",
       "      <td>19</td>\n",
       "      <td>43</td>\n",
       "      <td>23</td>\n",
       "      <td>12</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>-0.501310</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1.200094</td>\n",
       "      <td>-3.489950e-02</td>\n",
       "      <td>-0.999391</td>\n",
       "      <td>-0.130526</td>\n",
       "      <td>0.991445</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>16</td>\n",
       "      <td>4</td>\n",
       "      <td>38</td>\n",
       "      <td>19</td>\n",
       "      <td>30</td>\n",
       "      <td>31</td>\n",
       "      <td>23</td>\n",
       "      <td>41</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>-0.654195</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>1.200094</td>\n",
       "      <td>-3.489950e-02</td>\n",
       "      <td>-0.999391</td>\n",
       "      <td>-0.130526</td>\n",
       "      <td>0.991445</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>24</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>0.000000</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>1.200094</td>\n",
       "      <td>-3.489950e-02</td>\n",
       "      <td>-0.999391</td>\n",
       "      <td>-0.130526</td>\n",
       "      <td>0.991445</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>24</td>\n",
       "      <td>23</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   number_nmbr  number_NArw  category_NArw  category_1010_0  category_1010_1  \\\n",
       "0    -0.644929            0              0                0                0   \n",
       "1    -0.339160            0              0                0                0   \n",
       "2     0.596678            0              0                0                0   \n",
       "3     2.440556            0              0                0                1   \n",
       "4    -0.255769            0              0                0                1   \n",
       "5    -0.641871            0              0                0                1   \n",
       "6    -0.501310            0              0                0                0   \n",
       "7    -0.654195            0              1                1                0   \n",
       "8     0.000000            1              0                0                1   \n",
       "\n",
       "   category_1010_2  binary_NArw  binary_1010_0  binary_1010_1  datetime_NArw  \\\n",
       "0                1            0              0              1              0   \n",
       "1                1            0              0              1              0   \n",
       "2                1            0              0              1              0   \n",
       "3                0            0              0              1              0   \n",
       "4                0            0              0              0              0   \n",
       "5                1            0              0              0              0   \n",
       "6                0            0              0              0              0   \n",
       "7                0            1              1              0              1   \n",
       "8                0            0              0              1              1   \n",
       "\n",
       "   datetime_tmzn_year  datetime_tmzn_mdsn  datetime_tmzn_mdcs  \\\n",
       "0           -1.500117        5.145554e-01            0.857457   \n",
       "1           -0.825064        9.857698e-01           -0.168101   \n",
       "2           -0.825064        1.224647e-16           -1.000000   \n",
       "3           -0.150012       -8.207635e-01           -0.571268   \n",
       "4           -0.150012       -9.961947e-01            0.087156   \n",
       "5           -0.150012        4.098203e-01            0.912166   \n",
       "6            1.200094       -3.489950e-02           -0.999391   \n",
       "7            1.200094       -3.489950e-02           -0.999391   \n",
       "8            1.200094       -3.489950e-02           -0.999391   \n",
       "\n",
       "   datetime_tmzn_hmss  datetime_tmzn_hmsc  datetime_tmzn_bshr  \\\n",
       "0            0.258819            0.965926                   0   \n",
       "1            0.991445           -0.130526                   0   \n",
       "2            0.442289           -0.896873                   1   \n",
       "3           -0.422618           -0.906308                   1   \n",
       "4           -0.980785            0.195090                   0   \n",
       "5            0.000000            1.000000                   0   \n",
       "6           -0.130526            0.991445                   0   \n",
       "7           -0.130526            0.991445                   0   \n",
       "8           -0.130526            0.991445                   0   \n",
       "\n",
       "   datetime_tmzn_wkdy  datetime_tmzn_hldy  address_NArw  address_hash_0  \\\n",
       "0                   0                   0             0               5   \n",
       "1                   0                   0             0              19   \n",
       "2                   1                   0             0              24   \n",
       "3                   1                   0             0              32   \n",
       "4                   1                   0             0              20   \n",
       "5                   1                   1             0              29   \n",
       "6                   1                   0             0              16   \n",
       "7                   0                   0             1              24   \n",
       "8                   0                   0             0              24   \n",
       "\n",
       "   address_hash_1  address_hash_2  address_hash_3  address_hash_4  \\\n",
       "0               4              23              19              24   \n",
       "1              20               4              19              30   \n",
       "2              20              23              19              43   \n",
       "3               4              23              19              24   \n",
       "4              37              19              24              23   \n",
       "5              20              23              19              43   \n",
       "6               4              38              19              30   \n",
       "7               0               0               0               0   \n",
       "8              23               0               0               0   \n",
       "\n",
       "   address_hash_5  address_hash_6  address_hash_7  \n",
       "0              23              16               0  \n",
       "1              31              23              41  \n",
       "2              23              12               0  \n",
       "3              23              16               0  \n",
       "4              16               0               0  \n",
       "5              23              12               0  \n",
       "6              31              23              41  \n",
       "7               0               0               0  \n",
       "8               0               0               0  "
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#we'll assign the 1010 root category to two categoric columns\n",
    "assigncat = {'1010' : ['category', 'binary']}\n",
    "\n",
    "#Then go ahead and implement automunge\n",
    "#since this is a toy dataset we'll turn off ml infill\n",
    "\n",
    "train, train_ID, labels, \\\n",
    "val, val_ID, val_labels, \\\n",
    "test, test_ID, test_labels, \\\n",
    "postprocess_dict = \\\n",
    "am.automunge(df_train,\n",
    "             shuffletrain = False,\n",
    "             printstatus = False,\n",
    "             assigncat = assigncat,\n",
    "             MLinfill = False)\n",
    "\n",
    "#we'll go ahead and inspect the entire returned train set\n",
    "train"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here we see that between category and binary we ended up with 5 returned binarized columns (category_1010_0\tcategory_1010_1\tcategory_1010_2, binary_1010_0\tbinary_1010_1). Now let's try again with the Binary dimensionality reduction applied."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>number_nmbr</th>\n",
       "      <th>datetime_tmzn_year</th>\n",
       "      <th>datetime_tmzn_mdsn</th>\n",
       "      <th>datetime_tmzn_mdcs</th>\n",
       "      <th>datetime_tmzn_hmss</th>\n",
       "      <th>datetime_tmzn_hmsc</th>\n",
       "      <th>address_hash_0</th>\n",
       "      <th>address_hash_1</th>\n",
       "      <th>address_hash_2</th>\n",
       "      <th>address_hash_3</th>\n",
       "      <th>address_hash_4</th>\n",
       "      <th>address_hash_5</th>\n",
       "      <th>address_hash_6</th>\n",
       "      <th>address_hash_7</th>\n",
       "      <th>Binary_1010_0</th>\n",
       "      <th>Binary_1010_1</th>\n",
       "      <th>Binary_1010_2</th>\n",
       "      <th>Binary_1010_3</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>-0.644929</td>\n",
       "      <td>-1.500117</td>\n",
       "      <td>5.145554e-01</td>\n",
       "      <td>0.857457</td>\n",
       "      <td>0.258819</td>\n",
       "      <td>0.965926</td>\n",
       "      <td>5</td>\n",
       "      <td>4</td>\n",
       "      <td>23</td>\n",
       "      <td>19</td>\n",
       "      <td>24</td>\n",
       "      <td>23</td>\n",
       "      <td>16</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>-0.339160</td>\n",
       "      <td>-0.825064</td>\n",
       "      <td>9.857698e-01</td>\n",
       "      <td>-0.168101</td>\n",
       "      <td>0.991445</td>\n",
       "      <td>-0.130526</td>\n",
       "      <td>19</td>\n",
       "      <td>20</td>\n",
       "      <td>4</td>\n",
       "      <td>19</td>\n",
       "      <td>30</td>\n",
       "      <td>31</td>\n",
       "      <td>23</td>\n",
       "      <td>41</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>0.596678</td>\n",
       "      <td>-0.825064</td>\n",
       "      <td>1.224647e-16</td>\n",
       "      <td>-1.000000</td>\n",
       "      <td>0.442289</td>\n",
       "      <td>-0.896873</td>\n",
       "      <td>24</td>\n",
       "      <td>20</td>\n",
       "      <td>23</td>\n",
       "      <td>19</td>\n",
       "      <td>43</td>\n",
       "      <td>23</td>\n",
       "      <td>12</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>2.440556</td>\n",
       "      <td>-0.150012</td>\n",
       "      <td>-8.207635e-01</td>\n",
       "      <td>-0.571268</td>\n",
       "      <td>-0.422618</td>\n",
       "      <td>-0.906308</td>\n",
       "      <td>32</td>\n",
       "      <td>4</td>\n",
       "      <td>23</td>\n",
       "      <td>19</td>\n",
       "      <td>24</td>\n",
       "      <td>23</td>\n",
       "      <td>16</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>-0.255769</td>\n",
       "      <td>-0.150012</td>\n",
       "      <td>-9.961947e-01</td>\n",
       "      <td>0.087156</td>\n",
       "      <td>-0.980785</td>\n",
       "      <td>0.195090</td>\n",
       "      <td>20</td>\n",
       "      <td>37</td>\n",
       "      <td>19</td>\n",
       "      <td>24</td>\n",
       "      <td>23</td>\n",
       "      <td>16</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>-0.641871</td>\n",
       "      <td>-0.150012</td>\n",
       "      <td>4.098203e-01</td>\n",
       "      <td>0.912166</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>29</td>\n",
       "      <td>20</td>\n",
       "      <td>23</td>\n",
       "      <td>19</td>\n",
       "      <td>43</td>\n",
       "      <td>23</td>\n",
       "      <td>12</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>-0.501310</td>\n",
       "      <td>1.200094</td>\n",
       "      <td>-3.489950e-02</td>\n",
       "      <td>-0.999391</td>\n",
       "      <td>-0.130526</td>\n",
       "      <td>0.991445</td>\n",
       "      <td>16</td>\n",
       "      <td>4</td>\n",
       "      <td>38</td>\n",
       "      <td>19</td>\n",
       "      <td>30</td>\n",
       "      <td>31</td>\n",
       "      <td>23</td>\n",
       "      <td>41</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>-0.654195</td>\n",
       "      <td>1.200094</td>\n",
       "      <td>-3.489950e-02</td>\n",
       "      <td>-0.999391</td>\n",
       "      <td>-0.130526</td>\n",
       "      <td>0.991445</td>\n",
       "      <td>24</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>0.000000</td>\n",
       "      <td>1.200094</td>\n",
       "      <td>-3.489950e-02</td>\n",
       "      <td>-0.999391</td>\n",
       "      <td>-0.130526</td>\n",
       "      <td>0.991445</td>\n",
       "      <td>24</td>\n",
       "      <td>23</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   number_nmbr  datetime_tmzn_year  datetime_tmzn_mdsn  datetime_tmzn_mdcs  \\\n",
       "0    -0.644929           -1.500117        5.145554e-01            0.857457   \n",
       "1    -0.339160           -0.825064        9.857698e-01           -0.168101   \n",
       "2     0.596678           -0.825064        1.224647e-16           -1.000000   \n",
       "3     2.440556           -0.150012       -8.207635e-01           -0.571268   \n",
       "4    -0.255769           -0.150012       -9.961947e-01            0.087156   \n",
       "5    -0.641871           -0.150012        4.098203e-01            0.912166   \n",
       "6    -0.501310            1.200094       -3.489950e-02           -0.999391   \n",
       "7    -0.654195            1.200094       -3.489950e-02           -0.999391   \n",
       "8     0.000000            1.200094       -3.489950e-02           -0.999391   \n",
       "\n",
       "   datetime_tmzn_hmss  datetime_tmzn_hmsc  address_hash_0  address_hash_1  \\\n",
       "0            0.258819            0.965926               5               4   \n",
       "1            0.991445           -0.130526              19              20   \n",
       "2            0.442289           -0.896873              24              20   \n",
       "3           -0.422618           -0.906308              32               4   \n",
       "4           -0.980785            0.195090              20              37   \n",
       "5            0.000000            1.000000              29              20   \n",
       "6           -0.130526            0.991445              16               4   \n",
       "7           -0.130526            0.991445              24               0   \n",
       "8           -0.130526            0.991445              24              23   \n",
       "\n",
       "   address_hash_2  address_hash_3  address_hash_4  address_hash_5  \\\n",
       "0              23              19              24              23   \n",
       "1               4              19              30              31   \n",
       "2              23              19              43              23   \n",
       "3              23              19              24              23   \n",
       "4              19              24              23              16   \n",
       "5              23              19              43              23   \n",
       "6              38              19              30              31   \n",
       "7               0               0               0               0   \n",
       "8               0               0               0               0   \n",
       "\n",
       "   address_hash_6  address_hash_7  Binary_1010_0  Binary_1010_1  \\\n",
       "0              16               0              0              0   \n",
       "1              23              41              0              0   \n",
       "2              12               0              0              0   \n",
       "3              16               0              0              1   \n",
       "4               0               0              0              0   \n",
       "5              12               0              0              1   \n",
       "6              23              41              0              0   \n",
       "7               0               0              0              1   \n",
       "8               0               0              0              1   \n",
       "\n",
       "   Binary_1010_2  Binary_1010_3  \n",
       "0              0              1  \n",
       "1              0              1  \n",
       "2              1              0  \n",
       "3              0              0  \n",
       "4              1              1  \n",
       "5              0              1  \n",
       "6              0              0  \n",
       "7              1              0  \n",
       "8              1              1  "
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#we'll assign the 1010 root category to two categoric columns\n",
    "assigncat = {'1010' : ['category', 'binary']}\n",
    "\n",
    "#Then go ahead and implement automunge\n",
    "#since this is a toy dataset we'll turn off ml infill\n",
    "\n",
    "train, train_ID, labels, \\\n",
    "val, val_ID, val_labels, \\\n",
    "test, test_ID, test_labels, \\\n",
    "postprocess_dict = \\\n",
    "am.automunge(df_train,\n",
    "             shuffletrain = False,\n",
    "             printstatus = False,\n",
    "             assigncat = assigncat,\n",
    "             Binary = True,\n",
    "             MLinfill = False)\n",
    "\n",
    "#we'll go ahead and inspect the entire returned train set\n",
    "train"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here we see a reduction to only four returned binarized columns in a common aggregation (Binary_1010_0,\tBinary_1010_1, Binary_1010_2, Binary_1010_3). Note that the reduction in returned columns may be particularly noticable when there is a lot of correlation between categoric features."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "__________"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\"Categoric labels may have label smoothing applied [12], or fitted smoothing where null values are fit to class distributions.\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Label smoothing is available by the 'smth' root category, or fitted smoothing by the 'fsmh' root category. In each case the values for activations defaults to 0.9 and may be designated by passing a 'activation' parameter to assignparam. The null values under smth are based on the unique entry count and in fsmh are fit to class distributions corresponding to the activation associated with an entry.\n",
    "\n",
    "Note that smoothing is applied by default to train data and not to test data. To make test data smoothing on by default can apply the 'testsmooth' parameter to assignparam, or to selectively apply to test sets passed to postmunge can use the traindata postmunge(.) parameter.\n",
    "\n",
    "Here we'll first demonstrate the smth transform, as well as passing an alternate activation value to increase from default of 0.9 to 0.95."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>category_NArw</th>\n",
       "      <th>category_smth_0</th>\n",
       "      <th>category_smth_1</th>\n",
       "      <th>category_smth_2</th>\n",
       "      <th>category_smth_3</th>\n",
       "      <th>category_smth_4</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0</td>\n",
       "      <td>0.0125</td>\n",
       "      <td>0.9500</td>\n",
       "      <td>0.0125</td>\n",
       "      <td>0.0125</td>\n",
       "      <td>0.0125</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>0</td>\n",
       "      <td>0.0125</td>\n",
       "      <td>0.9500</td>\n",
       "      <td>0.0125</td>\n",
       "      <td>0.0125</td>\n",
       "      <td>0.0125</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>0</td>\n",
       "      <td>0.0125</td>\n",
       "      <td>0.9500</td>\n",
       "      <td>0.0125</td>\n",
       "      <td>0.0125</td>\n",
       "      <td>0.0125</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>0</td>\n",
       "      <td>0.0125</td>\n",
       "      <td>0.0125</td>\n",
       "      <td>0.9500</td>\n",
       "      <td>0.0125</td>\n",
       "      <td>0.0125</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>0</td>\n",
       "      <td>0.0125</td>\n",
       "      <td>0.0125</td>\n",
       "      <td>0.9500</td>\n",
       "      <td>0.0125</td>\n",
       "      <td>0.0125</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>0</td>\n",
       "      <td>0.0125</td>\n",
       "      <td>0.0125</td>\n",
       "      <td>0.0125</td>\n",
       "      <td>0.9500</td>\n",
       "      <td>0.0125</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>0</td>\n",
       "      <td>0.9500</td>\n",
       "      <td>0.0125</td>\n",
       "      <td>0.0125</td>\n",
       "      <td>0.0125</td>\n",
       "      <td>0.0125</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>1</td>\n",
       "      <td>0.0125</td>\n",
       "      <td>0.0125</td>\n",
       "      <td>0.0125</td>\n",
       "      <td>0.0125</td>\n",
       "      <td>0.9500</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>0</td>\n",
       "      <td>0.0125</td>\n",
       "      <td>0.0125</td>\n",
       "      <td>0.9500</td>\n",
       "      <td>0.0125</td>\n",
       "      <td>0.0125</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   category_NArw  category_smth_0  category_smth_1  category_smth_2  \\\n",
       "0              0           0.0125           0.9500           0.0125   \n",
       "1              0           0.0125           0.9500           0.0125   \n",
       "2              0           0.0125           0.9500           0.0125   \n",
       "3              0           0.0125           0.0125           0.9500   \n",
       "4              0           0.0125           0.0125           0.9500   \n",
       "5              0           0.0125           0.0125           0.0125   \n",
       "6              0           0.9500           0.0125           0.0125   \n",
       "7              1           0.0125           0.0125           0.0125   \n",
       "8              0           0.0125           0.0125           0.9500   \n",
       "\n",
       "   category_smth_3  category_smth_4  \n",
       "0           0.0125           0.0125  \n",
       "1           0.0125           0.0125  \n",
       "2           0.0125           0.0125  \n",
       "3           0.0125           0.0125  \n",
       "4           0.0125           0.0125  \n",
       "5           0.9500           0.0125  \n",
       "6           0.0125           0.0125  \n",
       "7           0.0125           0.9500  \n",
       "8           0.0125           0.0125  "
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#we'll assign the smth root category to the category column, \n",
    "#which we'll also designate as a label set\n",
    "#(this can also be assigned to features in train set if desired)\n",
    "\n",
    "assigncat = {'smth' : ['category']}\n",
    "\n",
    "labels_column = 'category'\n",
    "\n",
    "#we'll update the activation value from 0.9 to 0.95\n",
    "assignparam = {'default_assignparam' : {'smth' : {'activation' : 0.95}}}\n",
    "\n",
    "#Then go ahead and implement automunge\n",
    "#since this is a toy dataset we'll turn off ml infill\n",
    "\n",
    "train, train_ID, labels, \\\n",
    "val, val_ID, val_labels, \\\n",
    "test, test_ID, test_labels, \\\n",
    "postprocess_dict = \\\n",
    "am.automunge(df_train,\n",
    "             labels_column = labels_column, \n",
    "             shuffletrain = False,\n",
    "             printstatus = False,\n",
    "             assigncat = assigncat,\n",
    "             assignparam = assignparam,\n",
    "             Binary = True,\n",
    "             MLinfill = False)\n",
    "\n",
    "#since we are interested in the labels set it will be returned in labels\n",
    "labels"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now let's try again but this time with fitted smoothing via fsmh."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>category_NArw</th>\n",
       "      <th>category_smth_0</th>\n",
       "      <th>category_smth_1</th>\n",
       "      <th>category_smth_2</th>\n",
       "      <th>category_smth_3</th>\n",
       "      <th>category_smth_4</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0</td>\n",
       "      <td>0.008333</td>\n",
       "      <td>0.95000</td>\n",
       "      <td>0.02500</td>\n",
       "      <td>0.008333</td>\n",
       "      <td>0.008333</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>0</td>\n",
       "      <td>0.008333</td>\n",
       "      <td>0.95000</td>\n",
       "      <td>0.02500</td>\n",
       "      <td>0.008333</td>\n",
       "      <td>0.008333</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>0</td>\n",
       "      <td>0.008333</td>\n",
       "      <td>0.95000</td>\n",
       "      <td>0.02500</td>\n",
       "      <td>0.008333</td>\n",
       "      <td>0.008333</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>0</td>\n",
       "      <td>0.008333</td>\n",
       "      <td>0.02500</td>\n",
       "      <td>0.95000</td>\n",
       "      <td>0.008333</td>\n",
       "      <td>0.008333</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>0</td>\n",
       "      <td>0.008333</td>\n",
       "      <td>0.02500</td>\n",
       "      <td>0.95000</td>\n",
       "      <td>0.008333</td>\n",
       "      <td>0.008333</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>0</td>\n",
       "      <td>0.006250</td>\n",
       "      <td>0.01875</td>\n",
       "      <td>0.01875</td>\n",
       "      <td>0.950000</td>\n",
       "      <td>0.006250</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>0</td>\n",
       "      <td>0.950000</td>\n",
       "      <td>0.01875</td>\n",
       "      <td>0.01875</td>\n",
       "      <td>0.006250</td>\n",
       "      <td>0.006250</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>1</td>\n",
       "      <td>0.006250</td>\n",
       "      <td>0.01875</td>\n",
       "      <td>0.01875</td>\n",
       "      <td>0.006250</td>\n",
       "      <td>0.950000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>0</td>\n",
       "      <td>0.008333</td>\n",
       "      <td>0.02500</td>\n",
       "      <td>0.95000</td>\n",
       "      <td>0.008333</td>\n",
       "      <td>0.008333</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   category_NArw  category_smth_0  category_smth_1  category_smth_2  \\\n",
       "0              0         0.008333          0.95000          0.02500   \n",
       "1              0         0.008333          0.95000          0.02500   \n",
       "2              0         0.008333          0.95000          0.02500   \n",
       "3              0         0.008333          0.02500          0.95000   \n",
       "4              0         0.008333          0.02500          0.95000   \n",
       "5              0         0.006250          0.01875          0.01875   \n",
       "6              0         0.950000          0.01875          0.01875   \n",
       "7              1         0.006250          0.01875          0.01875   \n",
       "8              0         0.008333          0.02500          0.95000   \n",
       "\n",
       "   category_smth_3  category_smth_4  \n",
       "0         0.008333         0.008333  \n",
       "1         0.008333         0.008333  \n",
       "2         0.008333         0.008333  \n",
       "3         0.008333         0.008333  \n",
       "4         0.008333         0.008333  \n",
       "5         0.950000         0.006250  \n",
       "6         0.006250         0.006250  \n",
       "7         0.006250         0.950000  \n",
       "8         0.008333         0.008333  "
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#we'll assign the smth root category to the category column, \n",
    "#which we'll also designate as a label set\n",
    "#(this can also be assigned to features in train set if desired)\n",
    "\n",
    "assigncat = {'fsmh' : ['category']}\n",
    "\n",
    "labels_column = 'category'\n",
    "\n",
    "#we'll update the activation value from 0.9 to 0.95\n",
    "assignparam = {'default_assignparam' : {'fsmh' : {'activation' : 0.95}}}\n",
    "\n",
    "#Then go ahead and implement automunge\n",
    "#since this is a toy dataset we'll turn off ml infill\n",
    "\n",
    "train, train_ID, labels, \\\n",
    "val, val_ID, val_labels, \\\n",
    "test, test_ID, test_labels, \\\n",
    "postprocess_dict = \\\n",
    "am.automunge(df_train,\n",
    "             labels_column = labels_column, \n",
    "             shuffletrain = False,\n",
    "             printstatus = False,\n",
    "             assigncat = assigncat,\n",
    "             assignparam = assignparam,\n",
    "             Binary = True,\n",
    "             MLinfill = False)\n",
    "\n",
    "#since we are interested in the labels set it will be returned in labels\n",
    "labels"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "__________"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\"Data augmentation transformations [10] may be applied which make use of noise injection, including several variants for both numeric and categoric features.\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Data augmentation transforms were discussed in the paper Numeric Encoding Options with Automunge. Root categories for numeric transforms are available as DPnb (z-score normalized), DPmm (min-max normalized), DPrt (retain normalized), and root categories for categoric transforms are available as DPbn (binary for two value set), DPod (ordinal), DPoh (one hot), DP10 (binarized). \n",
    "\n",
    "Here we'll show a numeric and a categoric to demonstrate, just to pick one how about DPmm and DP10.\n",
    "\n",
    "Note that the whole point of data augmentation is to increase the number of training samples by adding noise injection, and so the workflow is slightly different since the same set is redundantly encoded with and without noise injection. We'll accomplish this by processing the same df_train set as both train and test data in automunge and concatinating the results. Note that alternatively additional copies of the data can be prepared in postmunge.\n",
    "\n",
    "Note that the convention for DP family of transforms is that noise is injected to train data by default and not to test data. If you would like noise injected to test data, you can perform in postmunge(.) by activating the traindata parameter."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>number_NArw</th>\n",
       "      <th>number_mnmx_DPmm</th>\n",
       "      <th>category_NArw</th>\n",
       "      <th>category_ord3_DPod_1010_0</th>\n",
       "      <th>category_ord3_DPod_1010_1</th>\n",
       "      <th>category_ord3_DPod_1010_2</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0</td>\n",
       "      <td>0.002994</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>0</td>\n",
       "      <td>0.101796</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>0</td>\n",
       "      <td>0.404648</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>0</td>\n",
       "      <td>0.961622</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>0</td>\n",
       "      <td>0.128743</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>0</td>\n",
       "      <td>0.003982</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>0</td>\n",
       "      <td>0.045695</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>0</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>1</td>\n",
       "      <td>0.211388</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>0</td>\n",
       "      <td>0.002994</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10</th>\n",
       "      <td>0</td>\n",
       "      <td>0.101796</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>11</th>\n",
       "      <td>0</td>\n",
       "      <td>0.404192</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>12</th>\n",
       "      <td>0</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>13</th>\n",
       "      <td>0</td>\n",
       "      <td>0.128743</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>14</th>\n",
       "      <td>0</td>\n",
       "      <td>0.003982</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>15</th>\n",
       "      <td>0</td>\n",
       "      <td>0.049401</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>16</th>\n",
       "      <td>0</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>17</th>\n",
       "      <td>1</td>\n",
       "      <td>0.211388</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "    number_NArw  number_mnmx_DPmm  category_NArw  category_ord3_DPod_1010_0  \\\n",
       "0             0          0.002994              0                          0   \n",
       "1             0          0.101796              0                          0   \n",
       "2             0          0.404648              0                          0   \n",
       "3             0          0.961622              0                          1   \n",
       "4             0          0.128743              0                          0   \n",
       "5             0          0.003982              0                          0   \n",
       "6             0          0.045695              0                          0   \n",
       "7             0          0.000000              1                          1   \n",
       "8             1          0.211388              0                          0   \n",
       "9             0          0.002994              0                          0   \n",
       "10            0          0.101796              0                          0   \n",
       "11            0          0.404192              0                          0   \n",
       "12            0          1.000000              0                          0   \n",
       "13            0          0.128743              0                          0   \n",
       "14            0          0.003982              0                          0   \n",
       "15            0          0.049401              0                          0   \n",
       "16            0          0.000000              1                          1   \n",
       "17            1          0.211388              0                          0   \n",
       "\n",
       "    category_ord3_DPod_1010_1  category_ord3_DPod_1010_2  \n",
       "0                           0                          0  \n",
       "1                           0                          0  \n",
       "2                           0                          0  \n",
       "3                           0                          0  \n",
       "4                           0                          1  \n",
       "5                           1                          1  \n",
       "6                           1                          0  \n",
       "7                           0                          0  \n",
       "8                           0                          1  \n",
       "9                           0                          0  \n",
       "10                          0                          0  \n",
       "11                          0                          0  \n",
       "12                          0                          1  \n",
       "13                          0                          1  \n",
       "14                          1                          1  \n",
       "15                          1                          0  \n",
       "16                          0                          0  \n",
       "17                          0                          1  "
      ]
     },
     "execution_count": 21,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#Let's demonstrate applying noise injection to features 'number' and 'category'\n",
    "assigncat = {'DPmm' : ['number'],\n",
    "             'DP10' : ['category']}\n",
    "\n",
    "#these transforms accept parameters to designate noise distribution profile\n",
    "#as well as the ratio fo entries to have noise injected\n",
    "#the ratio defaults to 0.03, here we'll increased to 0.5 for visualization purposes\n",
    "assignparam = {'default_assignparam' : {'DPmm' : {'flip_prob' : 0.5},\n",
    "                                        'DP10' : {'flip_prob' : 0.5}}}\n",
    "\n",
    "#we'll process the df_train as both train data (with noise injection) \n",
    "#and test data (without noise injection)\n",
    "#and contcatinate the results to increase the number of training samples\n",
    "train, train_ID, labels, \\\n",
    "val, val_ID, val_labels, \\\n",
    "test, test_ID, test_labels, \\\n",
    "postprocess_dict = \\\n",
    "am.automunge(df_train,\n",
    "             df_test = df_train,\n",
    "             shuffletrain = False,\n",
    "             printstatus = False,\n",
    "             assigncat = assigncat,\n",
    "             assignparam = assignparam,\n",
    "             MLinfill = False)\n",
    "\n",
    "#concatinate the train and test sets to increase sample count\n",
    "train   = pd.concat([train, test], axis=0, ignore_index=True)\n",
    "trainID = pd.concat([train_ID, test_ID], axis=0, ignore_index=True)\n",
    "labels  = pd.concat([labels, test_labels], axis=0, ignore_index=True)\n",
    "\n",
    "#here is what the output looks like\n",
    "#we'll just inspect those feature with noise injected\n",
    "train[postprocess_dict['column_map']['number'] + postprocess_dict['column_map']['category']]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In above dataframe, rows 0-8 have noise injected to a randomly sampled 50% of entries, and rows 9-17 are without noise injections."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "__________"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\"Sets of transformations to be directed at a target feature can be assembled which include generations and branches of derivations by making use of our “family tree primitives” [13], as can be used to redundantly encode a feature set in multiple configurations of varying information content.\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We already kind of demonstrated a family tree specification above, I'll show again here so you can see in contecxt."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>number_NArw</th>\n",
       "      <th>number_log0_mnmx</th>\n",
       "      <th>number_bnep_0</th>\n",
       "      <th>number_bnep_1</th>\n",
       "      <th>number_bnep_2</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1</td>\n",
       "      <td>0.631899</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>0</td>\n",
       "      <td>0.665794</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>0</td>\n",
       "      <td>0.868393</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>0</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>0</td>\n",
       "      <td>0.700661</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>0</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>0</td>\n",
       "      <td>0.556543</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>1</td>\n",
       "      <td>0.631899</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>1</td>\n",
       "      <td>0.631899</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   number_NArw  number_log0_mnmx  number_bnep_0  number_bnep_1  number_bnep_2\n",
       "0            1          0.631899              1              0              0\n",
       "1            0          0.665794              0              1              0\n",
       "2            0          0.868393              0              0              1\n",
       "3            0          1.000000              0              0              1\n",
       "4            0          0.700661              0              0              1\n",
       "5            0          0.000000              1              0              0\n",
       "6            0          0.556543              0              1              0\n",
       "7            1          0.631899              1              0              0\n",
       "8            1          0.631899              0              0              0"
      ]
     },
     "execution_count": 22,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#family trees are defined in the transformdict\n",
    "#here the upstream transforms are newt, bnep, and NArw\n",
    "#and since parents is a primitive with offspring\n",
    "#the downstream primitives are inspected \n",
    "#and mnmx is performed downstream of newt\n",
    "transformdict =  {'newt' : {'parents' : ['newt'], \\\n",
    "                            'siblings': [], \\\n",
    "                            'auntsuncles' : ['bnep'], \\\n",
    "                            'cousins' : ['NArw'], \\\n",
    "                            'children' : [], \\\n",
    "                            'niecesnephews' : [], \\\n",
    "                            'coworkers' : ['mnmx'], \\\n",
    "                            'friends' : []}}\n",
    "\n",
    "#since we're defining a new root category 'newt' in a transformdict\n",
    "#we'll need a corresponding processdict entry\n",
    "#we'll designate that the 'newt' category\n",
    "#is associated with a log0 transform\n",
    "#as applied above in the parents primitive\n",
    "processdict = {'newt' : {'functionpointer' : 'log0'}}\n",
    "\n",
    "#we'll assign this root category to column 'number'\n",
    "assigncat = {'newt' : ['number']}\n",
    "\n",
    "#since bnep returns 5 bins and our toy data set only has 9 entries\n",
    "#let's go ahead and set a parameter to return fewer bins\n",
    "#bnep accepts parameter 'bincount' as documented in read me\n",
    "assignparam = {'default_assignparam' : {'bnep' : {'bincount' : 3}}}\n",
    "\n",
    "#Then go ahead and implement automunge\n",
    "#since this is a toy dataset we'll turn off ml infill\n",
    "\n",
    "train, train_ID, labels, \\\n",
    "val, val_ID, val_labels, \\\n",
    "test, test_ID, test_labels, \\\n",
    "postprocess_dict = \\\n",
    "am.automunge(df_train,\n",
    "             shuffletrain = False,\n",
    "             printstatus = False,\n",
    "             transformdict = transformdict,\n",
    "             processdict = processdict, \n",
    "             assigncat = assigncat,\n",
    "             MLinfill = False,\n",
    "             assignparam = assignparam)\n",
    "\n",
    "#and we can then inspect the returned train set to view output\n",
    "train[postprocess_dict['column_map']['number']]\n",
    "\n",
    "#note that since the root category had a processdict entry derived \n",
    "#from a functionpointer to log0\n",
    "#the associated NArowtype is applied which treats entries <=0 as subject to infill"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "__________"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\"Such transformation sets may be accessed from those predefined in an internal library for simple assignment or alternatively may be custom configured. \""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We demonstrated with preceding custom configuration of a family tree. Those predefined in the internal library simply require assigning a column to a root category in assigncat.\n",
    "```\n",
    "assigncat = {'or19' : ['category']}\n",
    "```\n",
    "The READ ME documentation includes a comprehensive survey of transformations and transformation sets available in the library."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "__________"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\"Even the transformation functions themselves may be custom defined with only minimal requirements of simple data structures.\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The conventions for defining custom transformation functions are documented in the READ ME under section \"Custom Transformation Functions\"."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "__________"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\"Through application statistics of the features are recorded to facilitate detection of distribution drift.\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The drift statistics are recorded by default for train data passed to automunge. Drift reports are available for comparing subsequent test data passed to postmunge by activating the driftreport parameter. Results of the assessment can be viewed in the dictionary returned from postmunge as postreports_dict['driftreport']."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'number': {'origreturnedcolumns_list': ['number_NArw', 'number_nmbr'],\n",
       "  'newreturnedcolumns_list': ['number_NArw', 'number_nmbr'],\n",
       "  'drift_category': 'nmbr',\n",
       "  'orignotinnew': {},\n",
       "  'newnotinorig': {},\n",
       "  'newreturnedcolumn': {'number_NArw': {'orignormparam': {'pct_NArw': 0.1111111111111111},\n",
       "    'newnormparam': {'pct_NArw': 0.1111111111111111}},\n",
       "   'number_nmbr': {'orignormparam': {'mean': 69.60375,\n",
       "     'std': 107.92468600110679,\n",
       "     'max': 333.0,\n",
       "     'min': -1.0,\n",
       "     'offset': 0,\n",
       "     'multiplier': 1,\n",
       "     'cap': False,\n",
       "     'floor': False},\n",
       "    'newnormparam': {'mean': 69.60375,\n",
       "     'std': 107.92468600110679,\n",
       "     'max': 333.0,\n",
       "     'min': -1.0,\n",
       "     'offset': 0,\n",
       "     'multiplier': 1,\n",
       "     'cap': False,\n",
       "     'floor': False}}}},\n",
       " 'category': {'origreturnedcolumns_list': ['category_1010_0',\n",
       "   'category_1010_1',\n",
       "   'category_1010_2',\n",
       "   'category_NArw'],\n",
       "  'newreturnedcolumns_list': ['category_1010_0',\n",
       "   'category_1010_1',\n",
       "   'category_1010_2',\n",
       "   'category_NArw'],\n",
       "  'drift_category': '1010',\n",
       "  'orignotinnew': {},\n",
       "  'newnotinorig': {},\n",
       "  'newreturnedcolumn': {'category_1010_0': {'orignormparam': {'_1010_binary_encoding_dict': {1234: '000',\n",
       "      'circle': '001',\n",
       "      'square': '010',\n",
       "      'triangle': '011',\n",
       "      'zzzinfill': '100'},\n",
       "     '_1010_overlap_replace': {},\n",
       "     '_1010_binary_column_count': 3,\n",
       "     '_1010_activations_dict': {1234: 0.1111111111111111,\n",
       "      'circle': 0.3333333333333333,\n",
       "      'square': 0.3333333333333333,\n",
       "      'triangle': 0.1111111111111111,\n",
       "      'zzzinfill': 0.1111111111111111},\n",
       "     'str_convert': False},\n",
       "    'newnormparam': {'_1010_binary_encoding_dict': {1234: '000',\n",
       "      'circle': '001',\n",
       "      'square': '010',\n",
       "      'triangle': '011',\n",
       "      'zzzinfill': '100'},\n",
       "     '_1010_overlap_replace': {},\n",
       "     '_1010_binary_column_count': 3,\n",
       "     '_1010_activations_dict': {1234: 0.1111111111111111,\n",
       "      'circle': 0.3333333333333333,\n",
       "      'square': 0.3333333333333333,\n",
       "      'triangle': 0.1111111111111111,\n",
       "      'zzzinfill': 0.1111111111111111},\n",
       "     'str_convert': False}},\n",
       "   'category_1010_1': {'orignormparam': {'_1010_binary_encoding_dict': {1234: '000',\n",
       "      'circle': '001',\n",
       "      'square': '010',\n",
       "      'triangle': '011',\n",
       "      'zzzinfill': '100'},\n",
       "     '_1010_overlap_replace': {},\n",
       "     '_1010_binary_column_count': 3,\n",
       "     '_1010_activations_dict': {1234: 0.1111111111111111,\n",
       "      'circle': 0.3333333333333333,\n",
       "      'square': 0.3333333333333333,\n",
       "      'triangle': 0.1111111111111111,\n",
       "      'zzzinfill': 0.1111111111111111},\n",
       "     'str_convert': False},\n",
       "    'newnormparam': {'_1010_binary_encoding_dict': {1234: '000',\n",
       "      'circle': '001',\n",
       "      'square': '010',\n",
       "      'triangle': '011',\n",
       "      'zzzinfill': '100'},\n",
       "     '_1010_overlap_replace': {},\n",
       "     '_1010_binary_column_count': 3,\n",
       "     '_1010_activations_dict': {1234: 0.1111111111111111,\n",
       "      'circle': 0.3333333333333333,\n",
       "      'square': 0.3333333333333333,\n",
       "      'triangle': 0.1111111111111111,\n",
       "      'zzzinfill': 0.1111111111111111},\n",
       "     'str_convert': False}},\n",
       "   'category_1010_2': {'orignormparam': {'_1010_binary_encoding_dict': {1234: '000',\n",
       "      'circle': '001',\n",
       "      'square': '010',\n",
       "      'triangle': '011',\n",
       "      'zzzinfill': '100'},\n",
       "     '_1010_overlap_replace': {},\n",
       "     '_1010_binary_column_count': 3,\n",
       "     '_1010_activations_dict': {1234: 0.1111111111111111,\n",
       "      'circle': 0.3333333333333333,\n",
       "      'square': 0.3333333333333333,\n",
       "      'triangle': 0.1111111111111111,\n",
       "      'zzzinfill': 0.1111111111111111},\n",
       "     'str_convert': False},\n",
       "    'newnormparam': {'_1010_binary_encoding_dict': {1234: '000',\n",
       "      'circle': '001',\n",
       "      'square': '010',\n",
       "      'triangle': '011',\n",
       "      'zzzinfill': '100'},\n",
       "     '_1010_overlap_replace': {},\n",
       "     '_1010_binary_column_count': 3,\n",
       "     '_1010_activations_dict': {1234: 0.1111111111111111,\n",
       "      'circle': 0.3333333333333333,\n",
       "      'square': 0.3333333333333333,\n",
       "      'triangle': 0.1111111111111111,\n",
       "      'zzzinfill': 0.1111111111111111},\n",
       "     'str_convert': False}},\n",
       "   'category_NArw': {'orignormparam': {'pct_NArw': 0.1111111111111111},\n",
       "    'newnormparam': {'pct_NArw': 0.1111111111111111}}}},\n",
       " 'binary': {'origreturnedcolumns_list': ['binary_NArw', 'binary_bnry'],\n",
       "  'newreturnedcolumns_list': ['binary_NArw', 'binary_bnry'],\n",
       "  'drift_category': 'bnry',\n",
       "  'orignotinnew': {},\n",
       "  'newnotinorig': {},\n",
       "  'newreturnedcolumn': {'binary_NArw': {'orignormparam': {'pct_NArw': 0.1111111111111111},\n",
       "    'newnormparam': {'pct_NArw': 0.1111111111111111}},\n",
       "   'binary_bnry': {'orignormparam': {'missing': 'yes',\n",
       "     1: 'yes',\n",
       "     0: 'no',\n",
       "     'extravalues': [],\n",
       "     'oneratio': 0.6666666666666666,\n",
       "     'zeroratio': 0.3333333333333333,\n",
       "     'str_convert': False},\n",
       "    'newnormparam': {'missing': 'yes',\n",
       "     1: 'yes',\n",
       "     0: 'no',\n",
       "     'extravalues': [],\n",
       "     'oneratio': 0.6666666666666666,\n",
       "     'zeroratio': 0.3333333333333333,\n",
       "     'str_convert': False}}}},\n",
       " 'datetime': {'origreturnedcolumns_list': ['datetime_NArw',\n",
       "   'datetime_tmzn_bshr',\n",
       "   'datetime_tmzn_hldy',\n",
       "   'datetime_tmzn_hmsc',\n",
       "   'datetime_tmzn_hmss',\n",
       "   'datetime_tmzn_mdcs',\n",
       "   'datetime_tmzn_mdsn',\n",
       "   'datetime_tmzn_wkdy',\n",
       "   'datetime_tmzn_year'],\n",
       "  'newreturnedcolumns_list': ['datetime_NArw',\n",
       "   'datetime_tmzn',\n",
       "   'datetime_tmzn_bshr',\n",
       "   'datetime_tmzn_hldy',\n",
       "   'datetime_tmzn_hmsc',\n",
       "   'datetime_tmzn_hmss',\n",
       "   'datetime_tmzn_mdcs',\n",
       "   'datetime_tmzn_mdsn',\n",
       "   'datetime_tmzn_wkdy',\n",
       "   'datetime_tmzn_year'],\n",
       "  'drift_category': 'dat6',\n",
       "  'orignotinnew': {},\n",
       "  'newnotinorig': {'datetime_tmzn': {'newnormparam': {'timezone': False}}},\n",
       "  'newreturnedcolumn': {'datetime_NArw': {'orignormparam': {'pct_NArw': 0.2222222222222222},\n",
       "    'newnormparam': {'pct_NArw': 0.2222222222222222}},\n",
       "   'datetime_tmzn': {'orignormparam': {}, 'newnormparam': {'timezone': False}},\n",
       "   'datetime_tmzn_bshr': {'orignormparam': {'activationratio': 0.2222222222222222},\n",
       "    'newnormparam': {'activationratio': 0.2222222222222222}},\n",
       "   'datetime_tmzn_hldy': {'orignormparam': {'activationratio': 0.1111111111111111},\n",
       "    'newnormparam': {'activationratio': 0.1111111111111111}},\n",
       "   'datetime_tmzn_hmsc': {'orignormparam': {'scale': 'hourminutesecond',\n",
       "     'suffix': '_hmsc',\n",
       "     'function': 'cos',\n",
       "     'timemean': 2.7551268906481914,\n",
       "     'timemax': 6.152285613280012,\n",
       "     'timemin': 0.0,\n",
       "     'timestd': 2.301375980074111},\n",
       "    'newnormparam': {'scale': 'hourminutesecond',\n",
       "     'suffix': '_hmsc',\n",
       "     'function': 'cos',\n",
       "     'timemean': 2.7551268906481914,\n",
       "     'timemax': 6.152285613280012,\n",
       "     'timemin': 0.0,\n",
       "     'timestd': 2.301375980074111}},\n",
       "   'datetime_tmzn_hmss': {'orignormparam': {'scale': 'hourminutesecond',\n",
       "     'suffix': '_hmss',\n",
       "     'function': 'sin',\n",
       "     'timemean': 2.7551268906481914,\n",
       "     'timemax': 6.152285613280012,\n",
       "     'timemin': 0.0,\n",
       "     'timestd': 2.301375980074111},\n",
       "    'newnormparam': {'scale': 'hourminutesecond',\n",
       "     'suffix': '_hmss',\n",
       "     'function': 'sin',\n",
       "     'timemean': 2.7551268906481914,\n",
       "     'timemax': 6.152285613280012,\n",
       "     'timemin': 0.0,\n",
       "     'timestd': 2.301375980074111}},\n",
       "   'datetime_tmzn_mdcs': {'orignormparam': {'scale': 'monthday',\n",
       "     'suffix': '_mdcs',\n",
       "     'function': 'cos',\n",
       "     'timemean': 3.458245246451621,\n",
       "     'timemax': 6.705442384274988,\n",
       "     'timemin': 0.5404890586821149,\n",
       "     'timestd': 2.0169060026683727},\n",
       "    'newnormparam': {'scale': 'monthday',\n",
       "     'suffix': '_mdcs',\n",
       "     'function': 'cos',\n",
       "     'timemean': 3.458245246451621,\n",
       "     'timemax': 6.705442384274988,\n",
       "     'timemin': 0.5404890586821149,\n",
       "     'timestd': 2.0169060026683727}},\n",
       "   'datetime_tmzn_mdsn': {'orignormparam': {'scale': 'monthday',\n",
       "     'suffix': '_mdsn',\n",
       "     'function': 'sin',\n",
       "     'timemean': 3.458245246451621,\n",
       "     'timemax': 6.705442384274988,\n",
       "     'timemin': 0.5404890586821149,\n",
       "     'timestd': 2.0169060026683727},\n",
       "    'newnormparam': {'scale': 'monthday',\n",
       "     'suffix': '_mdsn',\n",
       "     'function': 'sin',\n",
       "     'timemean': 3.458245246451621,\n",
       "     'timemax': 6.705442384274988,\n",
       "     'timemin': 0.5404890586821149,\n",
       "     'timestd': 2.0169060026683727}},\n",
       "   'datetime_tmzn_wkdy': {'orignormparam': {'activationratio': 0.5555555555555556},\n",
       "    'newnormparam': {'activationratio': 0.5555555555555556}},\n",
       "   'datetime_tmzn_year': {'orignormparam': {'scale': 'year',\n",
       "     'suffix': '_year',\n",
       "     'normalization': 'zscore',\n",
       "     'scaler': 2019.2222222222222,\n",
       "     'divisor': 1.4813657362192647,\n",
       "     'timemean': 2019.2222222222222,\n",
       "     'timemax': 2021.0,\n",
       "     'timemin': 2017.0,\n",
       "     'timestd': 1.4813657362192647,\n",
       "     'maxminusmin': 4.0},\n",
       "    'newnormparam': {'scale': 'year',\n",
       "     'suffix': '_year',\n",
       "     'normalization': 'zscore',\n",
       "     'scaler': 2019.2222222222222,\n",
       "     'divisor': 1.4813657362192647,\n",
       "     'timemean': 2019.2222222222222,\n",
       "     'timemax': 2021.0,\n",
       "     'timemin': 2017.0,\n",
       "     'timestd': 1.4813657362192647,\n",
       "     'maxminusmin': 4.0}}}},\n",
       " 'address': {'origreturnedcolumns_list': ['address_NArw',\n",
       "   'address_hash_0',\n",
       "   'address_hash_1',\n",
       "   'address_hash_2',\n",
       "   'address_hash_3',\n",
       "   'address_hash_4',\n",
       "   'address_hash_5',\n",
       "   'address_hash_6',\n",
       "   'address_hash_7'],\n",
       "  'newreturnedcolumns_list': ['address_NArw',\n",
       "   'address_hash_0',\n",
       "   'address_hash_1',\n",
       "   'address_hash_2',\n",
       "   'address_hash_3',\n",
       "   'address_hash_4',\n",
       "   'address_hash_5',\n",
       "   'address_hash_6',\n",
       "   'address_hash_7'],\n",
       "  'drift_category': 'hash',\n",
       "  'orignotinnew': {},\n",
       "  'newnotinorig': {},\n",
       "  'newreturnedcolumn': {'address_NArw': {'orignormparam': {'pct_NArw': 0.1111111111111111},\n",
       "    'newnormparam': {'pct_NArw': 0.1111111111111111}},\n",
       "   'address_hash_0': {'orignormparam': {'hashcolumns': ['address_hash_0',\n",
       "      'address_hash_1',\n",
       "      'address_hash_2',\n",
       "      'address_hash_3',\n",
       "      'address_hash_4',\n",
       "      'address_hash_5',\n",
       "      'address_hash_6',\n",
       "      'address_hash_7'],\n",
       "     'col_count': 8,\n",
       "     'vocab_size_hash': 46,\n",
       "     'heuristic_multiplier': 2,\n",
       "     'heuristic_cap': 1024,\n",
       "     'max_length': 8,\n",
       "     'excluded_characters': [',', '.', '?', '!', '(', ')'],\n",
       "     'space': ' ',\n",
       "     'salt': '',\n",
       "     'max_column_count': False,\n",
       "     'hash_alg': 'hash'},\n",
       "    'newnormparam': {'hashcolumns': ['address_hash_0',\n",
       "      'address_hash_1',\n",
       "      'address_hash_2',\n",
       "      'address_hash_3',\n",
       "      'address_hash_4',\n",
       "      'address_hash_5',\n",
       "      'address_hash_6',\n",
       "      'address_hash_7'],\n",
       "     'col_count': 8,\n",
       "     'vocab_size_hash': 46,\n",
       "     'heuristic_multiplier': 2,\n",
       "     'heuristic_cap': 1024,\n",
       "     'max_length': 8,\n",
       "     'excluded_characters': [',', '.', '?', '!', '(', ')'],\n",
       "     'space': ' ',\n",
       "     'salt': '',\n",
       "     'max_column_count': False,\n",
       "     'hash_alg': 'hash'}},\n",
       "   'address_hash_1': {'orignormparam': {'hashcolumns': ['address_hash_0',\n",
       "      'address_hash_1',\n",
       "      'address_hash_2',\n",
       "      'address_hash_3',\n",
       "      'address_hash_4',\n",
       "      'address_hash_5',\n",
       "      'address_hash_6',\n",
       "      'address_hash_7'],\n",
       "     'col_count': 8,\n",
       "     'vocab_size_hash': 46,\n",
       "     'heuristic_multiplier': 2,\n",
       "     'heuristic_cap': 1024,\n",
       "     'max_length': 8,\n",
       "     'excluded_characters': [',', '.', '?', '!', '(', ')'],\n",
       "     'space': ' ',\n",
       "     'salt': '',\n",
       "     'max_column_count': False,\n",
       "     'hash_alg': 'hash'},\n",
       "    'newnormparam': {'hashcolumns': ['address_hash_0',\n",
       "      'address_hash_1',\n",
       "      'address_hash_2',\n",
       "      'address_hash_3',\n",
       "      'address_hash_4',\n",
       "      'address_hash_5',\n",
       "      'address_hash_6',\n",
       "      'address_hash_7'],\n",
       "     'col_count': 8,\n",
       "     'vocab_size_hash': 46,\n",
       "     'heuristic_multiplier': 2,\n",
       "     'heuristic_cap': 1024,\n",
       "     'max_length': 8,\n",
       "     'excluded_characters': [',', '.', '?', '!', '(', ')'],\n",
       "     'space': ' ',\n",
       "     'salt': '',\n",
       "     'max_column_count': False,\n",
       "     'hash_alg': 'hash'}},\n",
       "   'address_hash_2': {'orignormparam': {'hashcolumns': ['address_hash_0',\n",
       "      'address_hash_1',\n",
       "      'address_hash_2',\n",
       "      'address_hash_3',\n",
       "      'address_hash_4',\n",
       "      'address_hash_5',\n",
       "      'address_hash_6',\n",
       "      'address_hash_7'],\n",
       "     'col_count': 8,\n",
       "     'vocab_size_hash': 46,\n",
       "     'heuristic_multiplier': 2,\n",
       "     'heuristic_cap': 1024,\n",
       "     'max_length': 8,\n",
       "     'excluded_characters': [',', '.', '?', '!', '(', ')'],\n",
       "     'space': ' ',\n",
       "     'salt': '',\n",
       "     'max_column_count': False,\n",
       "     'hash_alg': 'hash'},\n",
       "    'newnormparam': {'hashcolumns': ['address_hash_0',\n",
       "      'address_hash_1',\n",
       "      'address_hash_2',\n",
       "      'address_hash_3',\n",
       "      'address_hash_4',\n",
       "      'address_hash_5',\n",
       "      'address_hash_6',\n",
       "      'address_hash_7'],\n",
       "     'col_count': 8,\n",
       "     'vocab_size_hash': 46,\n",
       "     'heuristic_multiplier': 2,\n",
       "     'heuristic_cap': 1024,\n",
       "     'max_length': 8,\n",
       "     'excluded_characters': [',', '.', '?', '!', '(', ')'],\n",
       "     'space': ' ',\n",
       "     'salt': '',\n",
       "     'max_column_count': False,\n",
       "     'hash_alg': 'hash'}},\n",
       "   'address_hash_3': {'orignormparam': {'hashcolumns': ['address_hash_0',\n",
       "      'address_hash_1',\n",
       "      'address_hash_2',\n",
       "      'address_hash_3',\n",
       "      'address_hash_4',\n",
       "      'address_hash_5',\n",
       "      'address_hash_6',\n",
       "      'address_hash_7'],\n",
       "     'col_count': 8,\n",
       "     'vocab_size_hash': 46,\n",
       "     'heuristic_multiplier': 2,\n",
       "     'heuristic_cap': 1024,\n",
       "     'max_length': 8,\n",
       "     'excluded_characters': [',', '.', '?', '!', '(', ')'],\n",
       "     'space': ' ',\n",
       "     'salt': '',\n",
       "     'max_column_count': False,\n",
       "     'hash_alg': 'hash'},\n",
       "    'newnormparam': {'hashcolumns': ['address_hash_0',\n",
       "      'address_hash_1',\n",
       "      'address_hash_2',\n",
       "      'address_hash_3',\n",
       "      'address_hash_4',\n",
       "      'address_hash_5',\n",
       "      'address_hash_6',\n",
       "      'address_hash_7'],\n",
       "     'col_count': 8,\n",
       "     'vocab_size_hash': 46,\n",
       "     'heuristic_multiplier': 2,\n",
       "     'heuristic_cap': 1024,\n",
       "     'max_length': 8,\n",
       "     'excluded_characters': [',', '.', '?', '!', '(', ')'],\n",
       "     'space': ' ',\n",
       "     'salt': '',\n",
       "     'max_column_count': False,\n",
       "     'hash_alg': 'hash'}},\n",
       "   'address_hash_4': {'orignormparam': {'hashcolumns': ['address_hash_0',\n",
       "      'address_hash_1',\n",
       "      'address_hash_2',\n",
       "      'address_hash_3',\n",
       "      'address_hash_4',\n",
       "      'address_hash_5',\n",
       "      'address_hash_6',\n",
       "      'address_hash_7'],\n",
       "     'col_count': 8,\n",
       "     'vocab_size_hash': 46,\n",
       "     'heuristic_multiplier': 2,\n",
       "     'heuristic_cap': 1024,\n",
       "     'max_length': 8,\n",
       "     'excluded_characters': [',', '.', '?', '!', '(', ')'],\n",
       "     'space': ' ',\n",
       "     'salt': '',\n",
       "     'max_column_count': False,\n",
       "     'hash_alg': 'hash'},\n",
       "    'newnormparam': {'hashcolumns': ['address_hash_0',\n",
       "      'address_hash_1',\n",
       "      'address_hash_2',\n",
       "      'address_hash_3',\n",
       "      'address_hash_4',\n",
       "      'address_hash_5',\n",
       "      'address_hash_6',\n",
       "      'address_hash_7'],\n",
       "     'col_count': 8,\n",
       "     'vocab_size_hash': 46,\n",
       "     'heuristic_multiplier': 2,\n",
       "     'heuristic_cap': 1024,\n",
       "     'max_length': 8,\n",
       "     'excluded_characters': [',', '.', '?', '!', '(', ')'],\n",
       "     'space': ' ',\n",
       "     'salt': '',\n",
       "     'max_column_count': False,\n",
       "     'hash_alg': 'hash'}},\n",
       "   'address_hash_5': {'orignormparam': {'hashcolumns': ['address_hash_0',\n",
       "      'address_hash_1',\n",
       "      'address_hash_2',\n",
       "      'address_hash_3',\n",
       "      'address_hash_4',\n",
       "      'address_hash_5',\n",
       "      'address_hash_6',\n",
       "      'address_hash_7'],\n",
       "     'col_count': 8,\n",
       "     'vocab_size_hash': 46,\n",
       "     'heuristic_multiplier': 2,\n",
       "     'heuristic_cap': 1024,\n",
       "     'max_length': 8,\n",
       "     'excluded_characters': [',', '.', '?', '!', '(', ')'],\n",
       "     'space': ' ',\n",
       "     'salt': '',\n",
       "     'max_column_count': False,\n",
       "     'hash_alg': 'hash'},\n",
       "    'newnormparam': {'hashcolumns': ['address_hash_0',\n",
       "      'address_hash_1',\n",
       "      'address_hash_2',\n",
       "      'address_hash_3',\n",
       "      'address_hash_4',\n",
       "      'address_hash_5',\n",
       "      'address_hash_6',\n",
       "      'address_hash_7'],\n",
       "     'col_count': 8,\n",
       "     'vocab_size_hash': 46,\n",
       "     'heuristic_multiplier': 2,\n",
       "     'heuristic_cap': 1024,\n",
       "     'max_length': 8,\n",
       "     'excluded_characters': [',', '.', '?', '!', '(', ')'],\n",
       "     'space': ' ',\n",
       "     'salt': '',\n",
       "     'max_column_count': False,\n",
       "     'hash_alg': 'hash'}},\n",
       "   'address_hash_6': {'orignormparam': {'hashcolumns': ['address_hash_0',\n",
       "      'address_hash_1',\n",
       "      'address_hash_2',\n",
       "      'address_hash_3',\n",
       "      'address_hash_4',\n",
       "      'address_hash_5',\n",
       "      'address_hash_6',\n",
       "      'address_hash_7'],\n",
       "     'col_count': 8,\n",
       "     'vocab_size_hash': 46,\n",
       "     'heuristic_multiplier': 2,\n",
       "     'heuristic_cap': 1024,\n",
       "     'max_length': 8,\n",
       "     'excluded_characters': [',', '.', '?', '!', '(', ')'],\n",
       "     'space': ' ',\n",
       "     'salt': '',\n",
       "     'max_column_count': False,\n",
       "     'hash_alg': 'hash'},\n",
       "    'newnormparam': {'hashcolumns': ['address_hash_0',\n",
       "      'address_hash_1',\n",
       "      'address_hash_2',\n",
       "      'address_hash_3',\n",
       "      'address_hash_4',\n",
       "      'address_hash_5',\n",
       "      'address_hash_6',\n",
       "      'address_hash_7'],\n",
       "     'col_count': 8,\n",
       "     'vocab_size_hash': 46,\n",
       "     'heuristic_multiplier': 2,\n",
       "     'heuristic_cap': 1024,\n",
       "     'max_length': 8,\n",
       "     'excluded_characters': [',', '.', '?', '!', '(', ')'],\n",
       "     'space': ' ',\n",
       "     'salt': '',\n",
       "     'max_column_count': False,\n",
       "     'hash_alg': 'hash'}},\n",
       "   'address_hash_7': {'orignormparam': {'hashcolumns': ['address_hash_0',\n",
       "      'address_hash_1',\n",
       "      'address_hash_2',\n",
       "      'address_hash_3',\n",
       "      'address_hash_4',\n",
       "      'address_hash_5',\n",
       "      'address_hash_6',\n",
       "      'address_hash_7'],\n",
       "     'col_count': 8,\n",
       "     'vocab_size_hash': 46,\n",
       "     'heuristic_multiplier': 2,\n",
       "     'heuristic_cap': 1024,\n",
       "     'max_length': 8,\n",
       "     'excluded_characters': [',', '.', '?', '!', '(', ')'],\n",
       "     'space': ' ',\n",
       "     'salt': '',\n",
       "     'max_column_count': False,\n",
       "     'hash_alg': 'hash'},\n",
       "    'newnormparam': {'hashcolumns': ['address_hash_0',\n",
       "      'address_hash_1',\n",
       "      'address_hash_2',\n",
       "      'address_hash_3',\n",
       "      'address_hash_4',\n",
       "      'address_hash_5',\n",
       "      'address_hash_6',\n",
       "      'address_hash_7'],\n",
       "     'col_count': 8,\n",
       "     'vocab_size_hash': 46,\n",
       "     'heuristic_multiplier': 2,\n",
       "     'heuristic_cap': 1024,\n",
       "     'max_length': 8,\n",
       "     'excluded_characters': [',', '.', '?', '!', '(', ')'],\n",
       "     'space': ' ',\n",
       "     'salt': '',\n",
       "     'max_column_count': False,\n",
       "     'hash_alg': 'hash'}}}}}"
      ]
     },
     "execution_count": 25,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#Here we'll just access a drift report \n",
    "#comparing the same data passed to automunge and postmunge\n",
    "\n",
    "#first process data in automunge\n",
    "train, train_ID, labels, \\\n",
    "val, val_ID, val_labels, \\\n",
    "test, test_ID, test_labels, \\\n",
    "postprocess_dict = \\\n",
    "am.automunge(df_train,\n",
    "             shuffletrain = False,\n",
    "             printstatus = False,\n",
    "             MLinfill = False)\n",
    "\n",
    "#then process additional data in postmunge using postprocess_dict returned from automunge\n",
    "#activate the driftreport parameter to assemble drift stat comparison\n",
    "test, test_ID, test_labels, \\\n",
    "postreports_dict = \\\n",
    "am.postmunge(postprocess_dict, \n",
    "             df_train,\n",
    "             driftreport = True,\n",
    "             printstatus = False)\n",
    "\n",
    "#we can view results in postmunge printouts or in report returned here\n",
    "postreports_dict['driftreport']\n",
    "\n",
    "#note that the report includes drift stats associated with the received feature\n",
    "#as well as drift stats collected with each transformation function applied"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "__________"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\"Inversion is available to recover the original form of data found preceding transformations, as may be useful to recover the original form of labels after inference.\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Inversion operations are performed in postmunge. When activating inversion we'll need to designate if the target is a test set or labels.\n",
    "\n",
    "Here we'll demonstrate encoding the train set with automunge and then recovering the original form with postmunge inversion."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>number</th>\n",
       "      <th>binary</th>\n",
       "      <th>category</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0.000000</td>\n",
       "      <td>yes</td>\n",
       "      <td>circle</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>33.000004</td>\n",
       "      <td>yes</td>\n",
       "      <td>circle</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>134.000000</td>\n",
       "      <td>yes</td>\n",
       "      <td>circle</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>333.000000</td>\n",
       "      <td>yes</td>\n",
       "      <td>square</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>42.000004</td>\n",
       "      <td>no</td>\n",
       "      <td>square</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>0.330002</td>\n",
       "      <td>no</td>\n",
       "      <td>triangle</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>15.500004</td>\n",
       "      <td>no</td>\n",
       "      <td>1234</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>-1.000000</td>\n",
       "      <td>yes</td>\n",
       "      <td>zzzinfill</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>69.603752</td>\n",
       "      <td>yes</td>\n",
       "      <td>square</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "       number binary   category\n",
       "0    0.000000    yes     circle\n",
       "1   33.000004    yes     circle\n",
       "2  134.000000    yes     circle\n",
       "3  333.000000    yes     square\n",
       "4   42.000004     no     square\n",
       "5    0.330002     no   triangle\n",
       "6   15.500004     no       1234\n",
       "7   -1.000000    yes  zzzinfill\n",
       "8   69.603752    yes     square"
      ]
     },
     "execution_count": 26,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#first process data in automunge\n",
    "train, train_ID, labels, \\\n",
    "val, val_ID, val_labels, \\\n",
    "test, test_ID, test_labels, \\\n",
    "postprocess_dict = \\\n",
    "am.automunge(df_train,\n",
    "             shuffletrain = False,\n",
    "             printstatus = False,\n",
    "             MLinfill = False)\n",
    "\n",
    "#then we'll invert the train set returned from automunge\n",
    "df_invert, recovered_list, inversion_info_dict = \\\n",
    "am.postmunge(postprocess_dict, \n",
    "             train, \n",
    "             inversion='test', \n",
    "             printstatus=False)\n",
    "\n",
    "#Here is the recoved data after inversion\n",
    "#(Note that we have a convention of applying arbitrary plug value of 'zzzinfill'\n",
    "#for entries that were not successfully recovered.)\n",
    "\n",
    "#note that a few transforms don't support inversion\n",
    "#such as transforms for datetime sets or hashing \n",
    "#(so columns 'datetime' and 'address' aren't recovered)\n",
    "#for details on which columns are recovered can turn on printouts with printstatus\n",
    "\n",
    "df_invert"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "__________"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\"Or of course if the data is received already numerically encoded the library can simply be applied as a tool for missing data infill.\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Since our toy data set isn't already numerically encoded I'll just note the parameter. Note that what is taking place here is a numeric set with any non-integer entries is treated as a continuous set for regression, an all integer set is treated as a categoric encoding for classification, unless the number of unique entries in the integer set exceeds a heuristic of 75% of the train data, then it is treated as a continuous integer set for regression. We assume missing data is recieved as NaN.\n",
    "\n",
    "Note this heuristic is not perfect, there may be cases where an integer set treated as a classification target is desired to be treated as a continuous set for regression, in which case we suggest assigning the column to exc8 in assigncat.\n",
    "\n",
    "The associated automunge parameter is:\n",
    "```\n",
    "powertransform = 'infill'\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
