{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Examples of Transformations\n",
    "\n",
    "### An Automunge Demonstration"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Automunge is available now for pip install:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "# !pip install Automunge"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Or to upgrade (we currently roll out upgrades pretty frequently):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "# !pip install Automunge --upgrade"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Once installed, run this in a local session to initialize:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "from Automunge import *\n",
    "am = AutoMunge()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Automunge is a platform for preparing tabular data for machine learning, where tabular data refers to two-dimensional tables of feature set columns and sampled rows, such as may be provided in the form of a pandas dataframe or numpy array.\n",
    "\n",
    "The library has two master functions:\n",
    "- automunge(.) for the initial preparation of \"training\" data\n",
    "- postmunge(.) for subsequent preparation of \"test\" data on a consistent basis\n",
    "\n",
    "Where training data refers to data that may then be applied to train a machine learning model, and test data refers to corresponding data that may be applied to generate predictions from that model.\n",
    "\n",
    "The preparations performed by the function include numerical encodings, missing data infill, and may also include feature engineering transformations sourced from an extensive built in library or even custom defined - making raw data suitable for the direct application of machine learning.\n",
    "\n",
    "Feature engineering transformations, and in some cases sets of transformations which may include generations and branches of derivations, may be applied to a distinct column in the received set. In some cases, received feature sets may be returned in multiple configurations.\n",
    "\n",
    "When the automunge(.) function is applied to a received tabular training set, in addition to applying data transformations, the function populates and returns a dictionary containing all of the steps and parameters of transformations, which may then be passed to the postmunge(.) function with a corresponding tabular test set for consistent processing on the training set basis.\n",
    "\n",
    "For our demonsrtations, we'll activate a few parameters to make visualization easier:\n",
    "- we'll turn off data set shuffling with shuffletrain parameter\n",
    "- we'll turn off printouts with printstatus parameter"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To demonstrate the types of data transformations available, let's create a toy data set with a few representative categories of data. We'll create a toy training set which includes a labels column and a toy test set without labels included."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import numpy as np\n",
    "\n",
    "toy_df_train = \\\n",
    "pd.DataFrame({'numbers':[1.0, 5.2, -3, 0, 11, np.nan], \\\n",
    "              'strings':['square', 'square', 'circle', 'circle', 'triangle', np.nan], \\\n",
    "              'time'   :['1/01/2017 1:00am', \n",
    "                         '3/10/2018 6:30am', \n",
    "                         '5/31/2018 10:15am', \n",
    "                         '7/26/2019 1:40pm', \n",
    "                         '9/05/2019 6:45pm', \n",
    "                         '12/25/2019'], \\\n",
    "              'labels' :['yes', 'yes', 'yes', 'no', 'no', 'yes'],\n",
    "             })\n",
    "\n",
    "\n",
    "toy_df_test = \\\n",
    "pd.DataFrame({'numbers':[-2, 3.4, 0, 10, np.nan, 5], \\\n",
    "              'strings':['circle', 'square', 'square', 'triangle', np.nan, 'circle'], \\\n",
    "              'time'   :['3/03/2018 4:00am', \n",
    "                         '7/10/2017 6:30pm', \n",
    "                         '11/12/2018 6:15am', \n",
    "                         '4/13/2019 12:40pm', \n",
    "                         '8/25/2017 6:45pm', \n",
    "                         '1/02/2018'],\n",
    "             })\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>numbers</th>\n",
       "      <th>strings</th>\n",
       "      <th>time</th>\n",
       "      <th>labels</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1.0</td>\n",
       "      <td>square</td>\n",
       "      <td>1/01/2017 1:00am</td>\n",
       "      <td>yes</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>5.2</td>\n",
       "      <td>square</td>\n",
       "      <td>3/10/2018 6:30am</td>\n",
       "      <td>yes</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>-3.0</td>\n",
       "      <td>circle</td>\n",
       "      <td>5/31/2018 10:15am</td>\n",
       "      <td>yes</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>0.0</td>\n",
       "      <td>circle</td>\n",
       "      <td>7/26/2019 1:40pm</td>\n",
       "      <td>no</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>11.0</td>\n",
       "      <td>triangle</td>\n",
       "      <td>9/05/2019 6:45pm</td>\n",
       "      <td>no</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>12/25/2019</td>\n",
       "      <td>yes</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   numbers   strings               time labels\n",
       "0      1.0    square   1/01/2017 1:00am    yes\n",
       "1      5.2    square   3/10/2018 6:30am    yes\n",
       "2     -3.0    circle  5/31/2018 10:15am    yes\n",
       "3      0.0    circle   7/26/2019 1:40pm     no\n",
       "4     11.0  triangle   9/05/2019 6:45pm     no\n",
       "5      NaN       NaN         12/25/2019    yes"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "toy_df_train"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>numbers</th>\n",
       "      <th>strings</th>\n",
       "      <th>time</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>-2.0</td>\n",
       "      <td>circle</td>\n",
       "      <td>3/03/2018 4:00am</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>3.4</td>\n",
       "      <td>square</td>\n",
       "      <td>7/10/2017 6:30pm</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>0.0</td>\n",
       "      <td>square</td>\n",
       "      <td>11/12/2018 6:15am</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>10.0</td>\n",
       "      <td>triangle</td>\n",
       "      <td>4/13/2019 12:40pm</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>8/25/2017 6:45pm</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>5.0</td>\n",
       "      <td>circle</td>\n",
       "      <td>1/02/2018</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   numbers   strings               time\n",
       "0     -2.0    circle   3/03/2018 4:00am\n",
       "1      3.4    square   7/10/2017 6:30pm\n",
       "2      0.0    square  11/12/2018 6:15am\n",
       "3     10.0  triangle  4/13/2019 12:40pm\n",
       "4      NaN       NaN   8/25/2017 6:45pm\n",
       "5      5.0    circle          1/02/2018"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "toy_df_test"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Our demonstrations will illustrate different types of data transformations that may be applied to these toy data sets. Let's start with numerical encodings under automation.\n",
    "\n",
    "When we run the automunge(.) function, the output returns a series of 17 sets, the naming of these returned sets has an optional convention for these demonstrations.\n",
    "\n",
    "- train: processed training data\n",
    "- trainID: index column and carveout columns corresponding to train data\n",
    "- labels: processed label sets corresponding to training data\n",
    "- validation1, validationID1, validationlabels1: validation sets carved out from training data\n",
    "- validation2, validationID2, validationlabels2: second validation sets carved out from training data\n",
    "- test, testID, testlabels: processed test data if a test set was passed to function\n",
    "- labelsencoding_dict, finalcolumns_train, finalcolumns_test, featureimportance: informational reports\n",
    "- postprocess_dict: populated dictionary for consistently preparing additional adata"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>numbers_nmbr</th>\n",
       "      <th>numbers_NArw</th>\n",
       "      <th>strings_NArw</th>\n",
       "      <th>strings_1010_0</th>\n",
       "      <th>strings_1010_1</th>\n",
       "      <th>time_NArw</th>\n",
       "      <th>time_tmzn_year</th>\n",
       "      <th>time_tmzn_mdsn</th>\n",
       "      <th>time_tmzn_mdcs</th>\n",
       "      <th>time_tmzn_hmss</th>\n",
       "      <th>time_tmzn_hmsc</th>\n",
       "      <th>time_tmzn_bshr</th>\n",
       "      <th>time_tmzn_wkdy</th>\n",
       "      <th>time_tmzn_hldy</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>-0.379221</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>-1.632993</td>\n",
       "      <td>5.145554e-01</td>\n",
       "      <td>0.857457</td>\n",
       "      <td>0.258819</td>\n",
       "      <td>0.965926</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>0.486392</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>-0.408248</td>\n",
       "      <td>9.857698e-01</td>\n",
       "      <td>-0.168101</td>\n",
       "      <td>0.991445</td>\n",
       "      <td>-0.130526</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>-1.203615</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>-0.408248</td>\n",
       "      <td>1.224647e-16</td>\n",
       "      <td>-1.000000</td>\n",
       "      <td>0.442289</td>\n",
       "      <td>-0.896873</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>-0.585319</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0.816497</td>\n",
       "      <td>-8.207635e-01</td>\n",
       "      <td>-0.571268</td>\n",
       "      <td>-0.422618</td>\n",
       "      <td>-0.906308</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>1.681763</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0.816497</td>\n",
       "      <td>-9.961947e-01</td>\n",
       "      <td>0.087156</td>\n",
       "      <td>-0.980785</td>\n",
       "      <td>0.195090</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>0.380870</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0.816497</td>\n",
       "      <td>4.098203e-01</td>\n",
       "      <td>0.912166</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   numbers_nmbr  numbers_NArw  strings_NArw  strings_1010_0  strings_1010_1  \\\n",
       "0     -0.379221             0             0               0               1   \n",
       "1      0.486392             0             0               0               1   \n",
       "2     -1.203615             0             0               0               0   \n",
       "3     -0.585319             0             0               0               0   \n",
       "4      1.681763             0             0               1               0   \n",
       "5      0.380870             1             1               0               0   \n",
       "\n",
       "   time_NArw  time_tmzn_year  time_tmzn_mdsn  time_tmzn_mdcs  time_tmzn_hmss  \\\n",
       "0          0       -1.632993    5.145554e-01        0.857457        0.258819   \n",
       "1          0       -0.408248    9.857698e-01       -0.168101        0.991445   \n",
       "2          0       -0.408248    1.224647e-16       -1.000000        0.442289   \n",
       "3          0        0.816497   -8.207635e-01       -0.571268       -0.422618   \n",
       "4          0        0.816497   -9.961947e-01        0.087156       -0.980785   \n",
       "5          0        0.816497    4.098203e-01        0.912166        0.000000   \n",
       "\n",
       "   time_tmzn_hmsc  time_tmzn_bshr  time_tmzn_wkdy  time_tmzn_hldy  \n",
       "0        0.965926               0               0               0  \n",
       "1       -0.130526               0               0               0  \n",
       "2       -0.896873               1               1               0  \n",
       "3       -0.906308               1               1               0  \n",
       "4        0.195090               0               1               0  \n",
       "5        1.000000               0               1               1  "
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "train, train_ID, labels, \\\n",
    "val, val_ID, val_labels, \\\n",
    "test, test_ID, test_labels, \\\n",
    "postprocess_dict = \\\n",
    "am.automunge(toy_df_train,\n",
    "             labels_column = 'labels',\n",
    "             shuffletrain = False,\n",
    "             printstatus = False)\n",
    "\n",
    "train"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Note that if this training data is to be used to train a model, the returned postprocess_dict dictionary should be externally saved, such as with the pickle library."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "And using the returned dictionary from the automunge(.) call, we can then consistently process test data, returning the same type and order of columns with consistent basis derived from the training set."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>numbers_nmbr</th>\n",
       "      <th>numbers_NArw</th>\n",
       "      <th>strings_NArw</th>\n",
       "      <th>strings_1010_0</th>\n",
       "      <th>strings_1010_1</th>\n",
       "      <th>time_NArw</th>\n",
       "      <th>time_tmzn_year</th>\n",
       "      <th>time_tmzn_mdsn</th>\n",
       "      <th>time_tmzn_mdcs</th>\n",
       "      <th>time_tmzn_hmss</th>\n",
       "      <th>time_tmzn_hmsc</th>\n",
       "      <th>time_tmzn_bshr</th>\n",
       "      <th>time_tmzn_wkdy</th>\n",
       "      <th>time_tmzn_hldy</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>-0.997516</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>-0.408248</td>\n",
       "      <td>0.998717</td>\n",
       "      <td>-0.050649</td>\n",
       "      <td>0.866025</td>\n",
       "      <td>0.500000</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>0.115415</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>-1.632993</td>\n",
       "      <td>-0.638465</td>\n",
       "      <td>-0.769651</td>\n",
       "      <td>-0.991445</td>\n",
       "      <td>0.130526</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>-0.585319</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>-0.408248</td>\n",
       "      <td>-0.309017</td>\n",
       "      <td>0.951057</td>\n",
       "      <td>0.997859</td>\n",
       "      <td>-0.065403</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>1.475665</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0.816497</td>\n",
       "      <td>0.731354</td>\n",
       "      <td>-0.681998</td>\n",
       "      <td>-0.173648</td>\n",
       "      <td>-0.984808</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>0.985975</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>-1.632993</td>\n",
       "      <td>-0.994869</td>\n",
       "      <td>-0.101168</td>\n",
       "      <td>-0.980785</td>\n",
       "      <td>0.195090</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>0.445173</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>-0.408248</td>\n",
       "      <td>0.528964</td>\n",
       "      <td>0.848644</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   numbers_nmbr  numbers_NArw  strings_NArw  strings_1010_0  strings_1010_1  \\\n",
       "0     -0.997516             0             0               0               0   \n",
       "1      0.115415             0             0               0               1   \n",
       "2     -0.585319             0             0               0               1   \n",
       "3      1.475665             0             0               1               0   \n",
       "4      0.985975             1             1               0               0   \n",
       "5      0.445173             0             0               0               0   \n",
       "\n",
       "   time_NArw  time_tmzn_year  time_tmzn_mdsn  time_tmzn_mdcs  time_tmzn_hmss  \\\n",
       "0          0       -0.408248        0.998717       -0.050649        0.866025   \n",
       "1          0       -1.632993       -0.638465       -0.769651       -0.991445   \n",
       "2          0       -0.408248       -0.309017        0.951057        0.997859   \n",
       "3          0        0.816497        0.731354       -0.681998       -0.173648   \n",
       "4          0       -1.632993       -0.994869       -0.101168       -0.980785   \n",
       "5          0       -0.408248        0.528964        0.848644        0.000000   \n",
       "\n",
       "   time_tmzn_hmsc  time_tmzn_bshr  time_tmzn_wkdy  time_tmzn_hldy  \n",
       "0        0.500000               0               0               0  \n",
       "1        0.130526               0               1               0  \n",
       "2       -0.065403               0               1               1  \n",
       "3       -0.984808               1               0               0  \n",
       "4        0.195090               0               1               0  \n",
       "5        1.000000               0               1               0  "
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "test, test_ID, test_labels, \\\n",
    "postreports_dict = \\\n",
    "am.postmunge(postprocess_dict, \n",
    "             toy_df_test, \n",
    "             printstatus = False)\n",
    "\n",
    "test"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# ___________________"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# numeric"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's inspect each of the returned sets in detail. For visualization we concatinate an input column with the corresponding returned set. The returned sets will have the same column header as the input but with a suffix appender associated with the transfomration that was applied."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>numbers</th>\n",
       "      <th>numbers_nmbr</th>\n",
       "      <th>numbers_NArw</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1.0</td>\n",
       "      <td>-0.379221</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>5.2</td>\n",
       "      <td>0.486392</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>-3.0</td>\n",
       "      <td>-1.203615</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>0.0</td>\n",
       "      <td>-0.585319</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>11.0</td>\n",
       "      <td>1.681763</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>NaN</td>\n",
       "      <td>0.380870</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   numbers  numbers_nmbr  numbers_NArw\n",
       "0      1.0     -0.379221             0\n",
       "1      5.2      0.486392             0\n",
       "2     -3.0     -1.203615             0\n",
       "3      0.0     -0.585319             0\n",
       "4     11.0      1.681763             0\n",
       "5      NaN      0.380870             1"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "inputcolumn = 'numbers'\n",
    "\n",
    "visualization = pd.concat([toy_df_train[inputcolumn], \\\n",
    "                           train[postprocess_dict['column_map'][inputcolumn]]], \\\n",
    "                           axis=1)\n",
    "                           \n",
    "visualization"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here we see that the input column 'numbers', containing numerical data, defaulted to a z-score normalization in which the data was centered and scaled to a mean of 0 and standard deviation of 1. The default missing data infill for this transform is imputation with the set's mean."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# categoric"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The defaults for categoric data applies numeric encoding by way of a binary transform, in which distinct categoric values may be represented by zero, one, or more simultaneous activations in the returned set. The convention for missing data infill is a distinct set of activations."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>strings</th>\n",
       "      <th>strings_NArw</th>\n",
       "      <th>strings_1010_0</th>\n",
       "      <th>strings_1010_1</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>square</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>square</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>circle</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>circle</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>triangle</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>NaN</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "    strings  strings_NArw  strings_1010_0  strings_1010_1\n",
       "0    square             0               0               1\n",
       "1    square             0               0               1\n",
       "2    circle             0               0               0\n",
       "3    circle             0               0               0\n",
       "4  triangle             0               1               0\n",
       "5       NaN             1               0               0"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "inputcolumn = 'strings'\n",
    "\n",
    "visualization = pd.concat([toy_df_train[inputcolumn], \\\n",
    "                           train[postprocess_dict['column_map'][inputcolumn]]], \\\n",
    "                           axis=1)\n",
    "                           \n",
    "visualization"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# datetime"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The datetime data has some more extensive transformations. Here the time stamps are segregated by time scale and subjected to both sin and cos transforms to allow a model to recognize periodicity. We also by default aggregate bin markers to designate time stamps that fall within business hours, weekdays, and holidays."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>time</th>\n",
       "      <th>time_NArw</th>\n",
       "      <th>time_tmzn_year</th>\n",
       "      <th>time_tmzn_mdsn</th>\n",
       "      <th>time_tmzn_mdcs</th>\n",
       "      <th>time_tmzn_hmss</th>\n",
       "      <th>time_tmzn_hmsc</th>\n",
       "      <th>time_tmzn_bshr</th>\n",
       "      <th>time_tmzn_wkdy</th>\n",
       "      <th>time_tmzn_hldy</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1/01/2017 1:00am</td>\n",
       "      <td>0</td>\n",
       "      <td>-1.632993</td>\n",
       "      <td>5.145554e-01</td>\n",
       "      <td>0.857457</td>\n",
       "      <td>0.258819</td>\n",
       "      <td>0.965926</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>3/10/2018 6:30am</td>\n",
       "      <td>0</td>\n",
       "      <td>-0.408248</td>\n",
       "      <td>9.857698e-01</td>\n",
       "      <td>-0.168101</td>\n",
       "      <td>0.991445</td>\n",
       "      <td>-0.130526</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>5/31/2018 10:15am</td>\n",
       "      <td>0</td>\n",
       "      <td>-0.408248</td>\n",
       "      <td>1.224647e-16</td>\n",
       "      <td>-1.000000</td>\n",
       "      <td>0.442289</td>\n",
       "      <td>-0.896873</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>7/26/2019 1:40pm</td>\n",
       "      <td>0</td>\n",
       "      <td>0.816497</td>\n",
       "      <td>-8.207635e-01</td>\n",
       "      <td>-0.571268</td>\n",
       "      <td>-0.422618</td>\n",
       "      <td>-0.906308</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>9/05/2019 6:45pm</td>\n",
       "      <td>0</td>\n",
       "      <td>0.816497</td>\n",
       "      <td>-9.961947e-01</td>\n",
       "      <td>0.087156</td>\n",
       "      <td>-0.980785</td>\n",
       "      <td>0.195090</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>12/25/2019</td>\n",
       "      <td>0</td>\n",
       "      <td>0.816497</td>\n",
       "      <td>4.098203e-01</td>\n",
       "      <td>0.912166</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                time  time_NArw  time_tmzn_year  time_tmzn_mdsn  \\\n",
       "0   1/01/2017 1:00am          0       -1.632993    5.145554e-01   \n",
       "1   3/10/2018 6:30am          0       -0.408248    9.857698e-01   \n",
       "2  5/31/2018 10:15am          0       -0.408248    1.224647e-16   \n",
       "3   7/26/2019 1:40pm          0        0.816497   -8.207635e-01   \n",
       "4   9/05/2019 6:45pm          0        0.816497   -9.961947e-01   \n",
       "5         12/25/2019          0        0.816497    4.098203e-01   \n",
       "\n",
       "   time_tmzn_mdcs  time_tmzn_hmss  time_tmzn_hmsc  time_tmzn_bshr  \\\n",
       "0        0.857457        0.258819        0.965926               0   \n",
       "1       -0.168101        0.991445       -0.130526               0   \n",
       "2       -1.000000        0.442289       -0.896873               1   \n",
       "3       -0.571268       -0.422618       -0.906308               1   \n",
       "4        0.087156       -0.980785        0.195090               0   \n",
       "5        0.912166        0.000000        1.000000               0   \n",
       "\n",
       "   time_tmzn_wkdy  time_tmzn_hldy  \n",
       "0               0               0  \n",
       "1               0               0  \n",
       "2               1               0  \n",
       "3               1               0  \n",
       "4               1               0  \n",
       "5               1               1  "
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "inputcolumn = 'time'\n",
    "\n",
    "visualization = pd.concat([toy_df_train[inputcolumn], \\\n",
    "                           train[postprocess_dict['column_map'][inputcolumn]]], \\\n",
    "                           axis=1)\n",
    "                           \n",
    "visualization"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# labels"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Our label set for categoric labels defaults to ordinal encoding, a different type of categoric encoding than the binary demonstrated above. In ordinal encoding, each distinct categoric value is associated with integer activations returned in a single column."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>labels</th>\n",
       "      <th>labels_ordl</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>yes</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>yes</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>yes</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>no</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>no</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>yes</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  labels  labels_ordl\n",
       "0    yes            1\n",
       "1    yes            1\n",
       "2    yes            1\n",
       "3     no            0\n",
       "4     no            0\n",
       "5    yes            1"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "inputcolumn = 'labels'\n",
    "\n",
    "#single column dataframes are returned as series, such as here with labels set\n",
    "visualization = pd.concat([toy_df_train[inputcolumn], \\\n",
    "                           labels], \\\n",
    "                           axis=1)\n",
    "                           \n",
    "visualization"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# ___________________"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# assigned transforms"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "A user does not have to defer to defaults. Automunge includes an extensive library of data transformations and data transformation sets that can be easily applied to a column. Let's demonstrate a few alternatives.\n",
    "\n",
    "Assigning transformations to a column is performed by passing a dictionary called the assigncat."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# normalizations"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "A few different types of normalizations are available. Here for example is a min-max scaling, which has more of a known range of output than the z-score normalization demonstrated above."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>numbers</th>\n",
       "      <th>numbers_mnmx</th>\n",
       "      <th>numbers_NArw</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1.0</td>\n",
       "      <td>0.285714</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>5.2</td>\n",
       "      <td>0.585714</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>-3.0</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>0.0</td>\n",
       "      <td>0.214286</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>11.0</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>NaN</td>\n",
       "      <td>0.555286</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   numbers  numbers_mnmx  numbers_NArw\n",
       "0      1.0      0.285714             0\n",
       "1      5.2      0.585714             0\n",
       "2     -3.0      0.000000             0\n",
       "3      0.0      0.214286             0\n",
       "4     11.0      1.000000             0\n",
       "5      NaN      0.555286             1"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "target_category = 'mnmx'\n",
    "inputcolumn     = 'numbers'\n",
    "\n",
    "assigncat = {target_category : inputcolumn}\n",
    "\n",
    "train, train_ID, labels, \\\n",
    "val, val_ID, val_labels, \\\n",
    "test, test_ID, test_labels, \\\n",
    "postprocess_dict \\\n",
    "= am.automunge(toy_df_train,\n",
    "               labels_column = 'labels',\n",
    "               shuffletrain = False,\n",
    "               assigncat = assigncat,\n",
    "               printstatus = False)\n",
    "\n",
    "\n",
    "\n",
    "visualization = pd.concat([toy_df_train[inputcolumn], \\\n",
    "                           train[postprocess_dict['column_map'][inputcolumn]]], \\\n",
    "                           axis=1)\n",
    "                           \n",
    "visualization"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Another type of normalization available in the library retains the sign of the received data after scaling, such as may benefit interpretibility."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>numbers</th>\n",
       "      <th>numbers_retn</th>\n",
       "      <th>numbers_NArw</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1.0</td>\n",
       "      <td>0.071429</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>5.2</td>\n",
       "      <td>0.371429</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>-3.0</td>\n",
       "      <td>-0.214286</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>0.0</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>11.0</td>\n",
       "      <td>0.785714</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>NaN</td>\n",
       "      <td>0.357000</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   numbers  numbers_retn  numbers_NArw\n",
       "0      1.0      0.071429             0\n",
       "1      5.2      0.371429             0\n",
       "2     -3.0     -0.214286             0\n",
       "3      0.0      0.000000             0\n",
       "4     11.0      0.785714             0\n",
       "5      NaN      0.357000             1"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "target_category = 'retn'\n",
    "inputcolumn     = 'numbers'\n",
    "\n",
    "assigncat = {target_category : inputcolumn}\n",
    "\n",
    "train, train_ID, labels, \\\n",
    "val, val_ID, val_labels, \\\n",
    "test, test_ID, test_labels, \\\n",
    "postprocess_dict \\\n",
    "= am.automunge(toy_df_train,\n",
    "               labels_column = 'labels',\n",
    "               shuffletrain = False,\n",
    "               assigncat = assigncat,\n",
    "               printstatus = False)\n",
    "\n",
    "\n",
    "\n",
    "visualization = pd.concat([toy_df_train[inputcolumn], \\\n",
    "                           train[postprocess_dict['column_map'][inputcolumn]]], \\\n",
    "                           axis=1)\n",
    "                           \n",
    "visualization"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Normalizations are also available to be applied in sets, such as to present a received column in multiple configurations to a training operation. Here is an example in which min-max normalization is supplemented by a set of aggregated bins indicating number of standard deviations from the mean."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>numbers</th>\n",
       "      <th>numbers_mnmx</th>\n",
       "      <th>numbers_bins_0</th>\n",
       "      <th>numbers_bins_1</th>\n",
       "      <th>numbers_bins_2</th>\n",
       "      <th>numbers_bins_3</th>\n",
       "      <th>numbers_bins_4</th>\n",
       "      <th>numbers_bins_5</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1.0</td>\n",
       "      <td>0.285714</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>5.2</td>\n",
       "      <td>0.585714</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>-3.0</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>0.0</td>\n",
       "      <td>0.214286</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>11.0</td>\n",
       "      <td>1.000000</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>NaN</td>\n",
       "      <td>0.503429</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   numbers  numbers_mnmx  numbers_bins_0  numbers_bins_1  numbers_bins_2  \\\n",
       "0      1.0      0.285714               0               0               1   \n",
       "1      5.2      0.585714               0               0               0   \n",
       "2     -3.0      0.000000               0               1               0   \n",
       "3      0.0      0.214286               0               0               1   \n",
       "4     11.0      1.000000               0               0               0   \n",
       "5      NaN      0.503429               0               0               0   \n",
       "\n",
       "   numbers_bins_3  numbers_bins_4  numbers_bins_5  \n",
       "0               0               0               0  \n",
       "1               1               0               0  \n",
       "2               0               0               0  \n",
       "3               0               0               0  \n",
       "4               0               1               0  \n",
       "5               0               0               0  "
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "target_category = 'mnm7'\n",
    "inputcolumn     = 'numbers'\n",
    "\n",
    "assigncat = {target_category : inputcolumn}\n",
    "\n",
    "train, train_ID, labels, \\\n",
    "val, val_ID, val_labels, \\\n",
    "test, test_ID, test_labels, \\\n",
    "postprocess_dict \\\n",
    "= am.automunge(toy_df_train,\n",
    "               labels_column = 'labels',\n",
    "               shuffletrain = False,\n",
    "               assigncat = assigncat,\n",
    "               printstatus = False)\n",
    "\n",
    "\n",
    "\n",
    "visualization = pd.concat([toy_df_train[inputcolumn], \\\n",
    "                           train[postprocess_dict['column_map'][inputcolumn]]], \\\n",
    "                           axis=1)\n",
    "                           \n",
    "visualization"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# categoric "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Similarily, there are several options available for categoric sets, let's demonstrate a few. First, again the defaults for categoric are binary encodings."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>strings</th>\n",
       "      <th>strings_NArw</th>\n",
       "      <th>strings_1010_0</th>\n",
       "      <th>strings_1010_1</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>square</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>square</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>circle</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>circle</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>triangle</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>NaN</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "    strings  strings_NArw  strings_1010_0  strings_1010_1\n",
       "0    square             0               0               1\n",
       "1    square             0               0               1\n",
       "2    circle             0               0               0\n",
       "3    circle             0               0               0\n",
       "4  triangle             0               1               0\n",
       "5       NaN             1               0               0"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "target_category = '1010'\n",
    "inputcolumn     = 'strings'\n",
    "\n",
    "assigncat = {target_category : inputcolumn}\n",
    "\n",
    "train, train_ID, labels, \\\n",
    "val, val_ID, val_labels, \\\n",
    "test, test_ID, test_labels, \\\n",
    "postprocess_dict \\\n",
    "= am.automunge(toy_df_train,\n",
    "               labels_column = 'labels',\n",
    "               shuffletrain = False,\n",
    "               assigncat = assigncat,\n",
    "               printstatus = False)\n",
    "\n",
    "\n",
    "\n",
    "visualization = pd.concat([toy_df_train[inputcolumn], \\\n",
    "                           train[postprocess_dict['column_map'][inputcolumn]]], \\\n",
    "                           axis=1)\n",
    "                           \n",
    "visualization"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Note that the defaults themselves are configurable."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here is one-hot encoding, in which each category has a unique column for activations."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>strings</th>\n",
       "      <th>strings_NArw</th>\n",
       "      <th>strings_circle</th>\n",
       "      <th>strings_square</th>\n",
       "      <th>strings_triangle</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>square</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>square</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>circle</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>circle</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>triangle</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>NaN</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "    strings  strings_NArw  strings_circle  strings_square  strings_triangle\n",
       "0    square             0               0               1                 0\n",
       "1    square             0               0               1                 0\n",
       "2    circle             0               1               0                 0\n",
       "3    circle             0               1               0                 0\n",
       "4  triangle             0               0               0                 1\n",
       "5       NaN             1               0               0                 0"
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "target_category = 'text'\n",
    "inputcolumn     = 'strings'\n",
    "\n",
    "assigncat = {target_category : inputcolumn}\n",
    "\n",
    "train, train_ID, labels, \\\n",
    "val, val_ID, val_labels, \\\n",
    "test, test_ID, test_labels, \\\n",
    "postprocess_dict \\\n",
    "= am.automunge(toy_df_train,\n",
    "               labels_column = 'labels',\n",
    "               shuffletrain = False,\n",
    "               assigncat = assigncat,\n",
    "               printstatus = False)\n",
    "\n",
    "\n",
    "\n",
    "visualization = pd.concat([toy_df_train[inputcolumn], \\\n",
    "                           train[postprocess_dict['column_map'][inputcolumn]]], \\\n",
    "                           axis=1)\n",
    "                           \n",
    "visualization"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Another option is ordinal integer encodings, which is applied as a default when the number of unique values exceeds a configuratble heuristic threshold."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>strings</th>\n",
       "      <th>strings_ord3</th>\n",
       "      <th>strings_NArw</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>square</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>square</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>circle</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>circle</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>triangle</td>\n",
       "      <td>2</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>NaN</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "    strings  strings_ord3  strings_NArw\n",
       "0    square             1             0\n",
       "1    square             1             0\n",
       "2    circle             0             0\n",
       "3    circle             0             0\n",
       "4  triangle             2             0\n",
       "5       NaN             1             1"
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "target_category = 'ord3'\n",
    "inputcolumn     = 'strings'\n",
    "\n",
    "assigncat = {target_category : inputcolumn}\n",
    "\n",
    "train, train_ID, labels, \\\n",
    "val, val_ID, val_labels, \\\n",
    "test, test_ID, test_labels, \\\n",
    "postprocess_dict \\\n",
    "= am.automunge(toy_df_train,\n",
    "               labels_column = 'labels',\n",
    "               shuffletrain = False,\n",
    "               assigncat = assigncat,\n",
    "               printstatus = False)\n",
    "\n",
    "\n",
    "\n",
    "visualization = pd.concat([toy_df_train[inputcolumn], \\\n",
    "                           train[postprocess_dict['column_map'][inputcolumn]]], \\\n",
    "                           axis=1)\n",
    "                           \n",
    "visualization"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Of course it's not enough to just numerically encode training data, we need to apply a consistent basis to test data, here demonstrated for the postmunge function."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>strings</th>\n",
       "      <th>strings_ord3</th>\n",
       "      <th>strings_NArw</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>circle</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>square</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>square</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>triangle</td>\n",
       "      <td>2</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>NaN</td>\n",
       "      <td>2</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>circle</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "    strings  strings_ord3  strings_NArw\n",
       "0    circle             0             0\n",
       "1    square             1             0\n",
       "2    square             1             0\n",
       "3  triangle             2             0\n",
       "4       NaN             2             1\n",
       "5    circle             0             0"
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "test, test_ID, test_labels, \\\n",
    "postreports_dict = \\\n",
    "am.postmunge(postprocess_dict, \n",
    "             toy_df_test,\n",
    "             printstatus = False)\n",
    "\n",
    "visualization = pd.concat([toy_df_test[inputcolumn], \\\n",
    "                           test[postprocess_dict['column_map'][inputcolumn]]], \\\n",
    "                           axis=1)\n",
    "                           \n",
    "visualization"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# transformation sets"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In some cases, we may desire to present our feature sets to machine learning in multiple configurations, such as for example if we are not sure of interpretation. In Automunge sets of transformations can be specified for application to a target column with a simple set of \"family tree\" primitives defined for a root category. Let's demonstrate here."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Custom defined family trees of transformations can be passed to a function call by way of the transformdict and processdict parameters. The transformdict parameter is used to populate a family tree, and the processdict parameter is to populate a supporting data structure for defining transformation category properties.\n",
    "\n",
    "Let's demonstrate a scenario to assemble a transformation set in which a normalization is supplemented by two types of bin aggregation. We'll create a new root category 'newt' and populate with transformations pre-defined in the library. Here we'll apply an upstream retain normalization and power of ten bins, and a standard deviation bins downstream of the retain normalization. We'll also include a NArw transformation which designates markers for entries that were subject to infill based on values of the source column."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [],
   "source": [
    "transformdict =  {'newt' : {'parents'       : ['newt'],\n",
    "                            'siblings'      : [],\n",
    "                            'auntsuncles'   : ['pwr2'],\n",
    "                            'cousins'       : ['NArw'],\n",
    "                            'children'      : [],\n",
    "                            'niecesnephews' : [],\n",
    "                            'coworkers'     : [],\n",
    "                            'friends'       : ['bins']}}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The corresponding processdict will make use of transformation functions defined in the library for the retain normalization. Here NArowtype designates the types of entries from the source column that will be targets for infill, MLinfilltype designates the types of predictive models to be trained for ML infill, and labelctgy is a support entry for feature importance for cases where a label is returned in multiple configurations."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [],
   "source": [
    "processdict    =  {'newt' : {'functionpointer' : 'retn', \n",
    "                             'NArowtype'       : 'numeric',\n",
    "                             'MLinfilltype'    : 'numeric',\n",
    "                             'labelctgy'       : 'newt'}}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can then pass these populated structures to a function call and assign a column to the newly defined root category 'newt'. If we want we can also apply ML infill even on this custom defined transformation set."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>numbers</th>\n",
       "      <th>numbers_NArw</th>\n",
       "      <th>numbers_retn</th>\n",
       "      <th>numbers_retn_bins_0</th>\n",
       "      <th>numbers_retn_bins_1</th>\n",
       "      <th>numbers_retn_bins_2</th>\n",
       "      <th>numbers_retn_bins_3</th>\n",
       "      <th>numbers_retn_bins_4</th>\n",
       "      <th>numbers_retn_bins_5</th>\n",
       "      <th>numbers_-10^0</th>\n",
       "      <th>numbers_10^0</th>\n",
       "      <th>numbers_10^1</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1.0</td>\n",
       "      <td>0</td>\n",
       "      <td>0.071429</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>5.2</td>\n",
       "      <td>0</td>\n",
       "      <td>0.371429</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>-3.0</td>\n",
       "      <td>0</td>\n",
       "      <td>-0.214286</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>0.0</td>\n",
       "      <td>0</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>11.0</td>\n",
       "      <td>0</td>\n",
       "      <td>0.785714</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>NaN</td>\n",
       "      <td>1</td>\n",
       "      <td>0.315857</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   numbers  numbers_NArw  numbers_retn  numbers_retn_bins_0  \\\n",
       "0      1.0             0      0.071429                    0   \n",
       "1      5.2             0      0.371429                    0   \n",
       "2     -3.0             0     -0.214286                    0   \n",
       "3      0.0             0      0.000000                    0   \n",
       "4     11.0             0      0.785714                    0   \n",
       "5      NaN             1      0.315857                    0   \n",
       "\n",
       "   numbers_retn_bins_1  numbers_retn_bins_2  numbers_retn_bins_3  \\\n",
       "0                    0                    1                    0   \n",
       "1                    0                    0                    1   \n",
       "2                    1                    0                    0   \n",
       "3                    0                    1                    0   \n",
       "4                    0                    0                    0   \n",
       "5                    0                    0                    0   \n",
       "\n",
       "   numbers_retn_bins_4  numbers_retn_bins_5  numbers_-10^0  numbers_10^0  \\\n",
       "0                    0                    0              0             1   \n",
       "1                    0                    0              0             1   \n",
       "2                    0                    0              1             0   \n",
       "3                    0                    0              0             0   \n",
       "4                    1                    0              0             0   \n",
       "5                    0                    0              0             0   \n",
       "\n",
       "   numbers_10^1  \n",
       "0             0  \n",
       "1             0  \n",
       "2             0  \n",
       "3             0  \n",
       "4             1  \n",
       "5             0  "
      ]
     },
     "execution_count": 22,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "target_category = 'newt'\n",
    "inputcolumn     = 'numbers'\n",
    "\n",
    "assigncat = {target_category : inputcolumn}\n",
    "\n",
    "train, train_ID, labels, \\\n",
    "val, val_ID, val_labels, \\\n",
    "test, test_ID, test_labels, \\\n",
    "postprocess_dict \\\n",
    "= am.automunge(toy_df_train,\n",
    "               labels_column = 'labels',\n",
    "               shuffletrain = False,\n",
    "               assigncat = assigncat,\n",
    "               transformdict = transformdict,\n",
    "               processdict = processdict,\n",
    "               printstatus = False)\n",
    "\n",
    "\n",
    "\n",
    "visualization = pd.concat([toy_df_train[inputcolumn], \\\n",
    "                           train[postprocess_dict['column_map'][inputcolumn]]], \\\n",
    "                           axis=1)\n",
    "                           \n",
    "visualization"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
