{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from IPython.core.display import display, HTML\n",
    "display(HTML(\"<style>.container { width:100% !important; }</style>\"))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Embedding CPLEX in a ML Spark Pipeline\n",
    "`Spark ML` provides a uniform set of high-level APIs that help users create and tune practical machine learning pipelines.\n",
    "\n",
    "In this notebook, we show how to embed CPLEX as a Spark _transformer_ class.\n",
    "DOcplex provides transformer classes that take a matrix `X` of constraints and a vector `y` of costs and solves a linear problem using CPLEX.\n",
    "\n",
    "Transformer classes share a `solve(X, Y, **params)` method which expects:\n",
    " - an X matrix containing the constraints of the linear problem\n",
    " - a Y vector containing the cost coefficients.\n",
    " \n",
    "The transformer classes requires a Spark DataFrame for the 'X' matrix, and support various formats for the 'Y' vector:\n",
    "\n",
    "- Python lists,\n",
    "- numpy vector,\n",
    "- pandas Series,\n",
    "- Spark columns\n",
    "\n",
    "The same formats are also supported to optionally specify upper bounds for decision variables.\n",
    "\n",
    " \n",
    "## DOcplex transformer classes\n",
    "\n",
    "There are two DOcplex transformer classes:\n",
    "\n",
    " - __$CplexTransformer$__ expects to solve a linear problem in the classical form: \n",
    " \n",
    " $$ minimize\\  C^{t} x\\\\ s.t.\\\\ \n",
    " Ax <= B$$\n",
    " \n",
    " Where $A$ is a (M,N) matrix describing the constraints and $B$ is a scalar vector of size M, containing the _right hand sides_ of the constraints, and $C$ is the _cost vector_ of size N. In this case the transformer expects a (M,N+1) matrix, where the last column contains the right hand sides.\n",
    " \n",
    " - __$CplexRangeTransformer$__ expects to solve linear problem as a set of _range_ constraints:\n",
    " \n",
    "  $$ minimize\\  C^{t} x\\\\ s.t.\\\\ \n",
    " m <= Ax <= M$$\n",
    " \n",
    "  Where $A$ is a (M,N) matrix describing the constraints, $m$ and $M$ are two scalar vectors of size M, containing the _minimum_ and _maximum_ values for the row expressions, and $C$ is the _cost vector_ of size N. In this case the transformer expects a (M,N+2) matrix, where the last two columns contains the minimum and maximum values (in this order).\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "try:\n",
    "    import numpy as np\n",
    "except ImportError:\n",
    "    raise RuntimError('This notebook requires numpy')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "In the next section we illustrate the range transformer with the Diet Problem, from DOcplex distributed examples.\n",
    "\n",
    "\n",
    "## The Diet Problem\n",
    "\n",
    "The diet problem is delivered in the DOcplex examples.\n",
    "\n",
    "Given a breakdown matrix of various foods in elementary nutrients, plus limitations on quantities for foods an nutrients, and food costs, the goal is to find the optimal quantity for each food for a balanced diet.\n",
    "\n",
    "The __FOOD_NUTRIENTS__ data intentionally contains a missing value ($np.nan$) to illustrate the use of a pipeline involving a data cleansing stage."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# the baseline diet data as Python lists of tuples.\n",
    "FOODS = [\n",
    "    (\"Roasted Chicken\", 0.84, 0, 10),\n",
    "    (\"Spaghetti W/ Sauce\", 0.78, 0, 10),\n",
    "    (\"Tomato,Red,Ripe,Raw\", 0.27, 0, 10),\n",
    "    (\"Apple,Raw,W/Skin\", .24, 0, 10),\n",
    "    (\"Grapes\", 0.32, 0, 10),\n",
    "    (\"Chocolate Chip Cookies\", 0.03, 0, 10),\n",
    "    (\"Lowfat Milk\", 0.23, 0, 10),\n",
    "    (\"Raisin Brn\", 0.34, 0, 10),\n",
    "    (\"Hotdog\", 0.31, 0, 10)\n",
    "]\n",
    "\n",
    "NUTRIENTS = [\n",
    "    (\"Calories\", 2000, 2500),\n",
    "    (\"Calcium\", 800, 1600),\n",
    "    (\"Iron\", 10, 30),\n",
    "    (\"Vit_A\", 5000, 50000),\n",
    "    (\"Dietary_Fiber\", 25, 100),\n",
    "    (\"Carbohydrates\", 0, 300),\n",
    "    (\"Protein\", 50, 100)\n",
    "]\n",
    "\n",
    "FOOD_NUTRIENTS = [\n",
    "    (\"Roasted Chicken\", 277.4, 21.9, 1.8, 77.4, 0.0, 0.0, 42.2),\n",
    "    (\"Spaghetti W/ Sauce\", 358.2, 80.2, 2.3, 3055.2, 11.6, 58.3, 8.2),\n",
    "    (\"Tomato,Red,Ripe,Raw\", 25.8, 6.2, 0.6, 766.3, 1.4, 5.7, 1.0),\n",
    "    (\"Apple,Raw,W/Skin\", 81.4, 9.7, 0.2, 73.1, 3.7, 21.0, 0.3),\n",
    "    (\"Grapes\", 15.1, 3.4, 0.1, 24.0, 0.2, 4.1, 0.2),\n",
    "    (\"Chocolate Chip Cookies\", 78.1, 6.2, 0.4, 101.8, 0.0, 9.3, 0.9),\n",
    "    (\"Lowfat Milk\", 121.2, 296.7, 0.1, 500.2, 0.0, 11.7, 8.1),\n",
    "    (\"Raisin Brn\", 115.1, 12.9, 16.8, 1250.2, 4.0, 27.9, 4.0),\n",
    "    (\"Hotdog\", 242.1, 23.5, 2.3, 0.0, 0.0, 18.0, 10.4)\n",
    "]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "nb_foods = len(FOODS)\n",
    "nb_nutrients = len(NUTRIENTS)\n",
    "print('#foods={0}'.format(nb_foods))\n",
    "print('#nutrients={0}'.format(nb_nutrients))\n",
    "\n",
    "assert nb_foods == len(FOOD_NUTRIENTS)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Creating a Spark session"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "try:\n",
    "    import findspark\n",
    "    findspark.init()\n",
    "except ImportError:\n",
    "    # Ignore exception: the 'findspark' module is required when executing Spark in a Windows environment\n",
    "    pass\n",
    "\n",
    "import pyspark  # Only run after findspark.init() (if running in a Windows environment)\n",
    "from pyspark.sql import SparkSession\n",
    "\n",
    "spark = SparkSession.builder.getOrCreate()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Using the transformer with a Spark dataframe\n",
    "\n",
    "In this section we show how to use a transformer with data stored in a Spark dataframe."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Prepare the data as a numpy matrix\n",
    "\n",
    "In this section we build a numpy matrix to be passed to the transformer.\n",
    "\n",
    "First, we extract the food to nutrient matrix by stripping the names."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "mat_fn = np.matrix([FOOD_NUTRIENTS[f][1:] for f in range(nb_foods)])\n",
    "print('The food-nutrient matrix has shape: {0}'.format(mat_fn.shape))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Then we extract the two vectors of min/max for each nutrient. Each vector has nb_nutrients elements.\n",
    "We also break the `FOODS` collection of tuples into columns"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "nutrient_mins = [NUTRIENTS[n][1] for n in range(nb_nutrients)]\n",
    "nutrient_maxs = [NUTRIENTS[n][2] for n in range(nb_nutrients)]\n",
    "\n",
    "food_names ,food_costs, food_mins, food_maxs = map(list, zip(*FOODS))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We are now ready to prepare the transformer matrix. This matrix has shape (7, 11) as we\n",
    "have 7 nutrients and 9 foods, plus the additional `min` and `max` columns"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# step 1. add two lines for nutrient mins, maxs\n",
    "nf2 = np.append(mat_fn, np.matrix([nutrient_mins, nutrient_maxs]), axis=0)\n",
    "mat_nf = nf2.transpose()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "mat_nf.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Populate a Spark dataframe with the matrix data\n",
    "\n",
    "In this section we build a Spark dataframe matrix to be passed to the transformer.\n",
    "\n",
    "Using a Spark dataframe will also allow us to chain multiple transformers in a pipeline."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from pyspark.sql import SQLContext\n",
    "\n",
    "sc = spark.sparkContext\n",
    "sqlContext = SQLContext(sc)\n",
    "\n",
    "columns = food_names + ['min', 'max']\n",
    "food_nutrients_df = sqlContext.createDataFrame(mat_nf.tolist(), columns)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's display the dataframe schema and content"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "food_nutrients_df.printSchema()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "food_nutrients_df.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Solving the Diet problem with the $CplexRangeTransformer$ in a Pipeline\n",
    "\n",
    "To use the transformer, create an instance and pass the following parameters to the `transform` method\n",
    " - the `X` matrix of size(M, N+2) containing coefficients for N column variables plus two addition column for range mins and maxs.\n",
    " - the `Y` cost vector (using __\"y\"__ parameter id)\n",
    " - whether one wants to solve a minimization (`min`) or maximization (`max`) problem (using __\"sense\"__ parameter id)\n",
    " \n",
    "In addition, some data elements that can't be encoded in the matrix itself should be passed as keyword arguments:\n",
    "\n",
    "- `ubs` denotes the upper bound for the column variables that are created. The expected size of this scalar vector is N (when matrix has size (M,N+2))\n",
    "- `minCol` and `maxCol` are the names of the columns corresponding to the constraints min and max range in the `X` matrix"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from docplex.mp.sparktrans.transformers import CplexRangeTransformer\n",
    "from pyspark.ml import Pipeline\n",
    "from pyspark.sql.functions import *\n",
    "\n",
    "# Create the optimization transformer to calculate the optimal quantity for each food for a balanced diet.\n",
    "cplexSolve = CplexRangeTransformer(minCol='min', maxCol='max', ubs=food_maxs)\n",
    "\n",
    "# Make evaluation on input data. Additional parameters are specified using the 'params' dictionary\n",
    "diet_df = cplexSolve.transform(food_nutrients_df, params={cplexSolve.y: food_costs, cplexSolve.sense: 'min'})\n",
    "diet_df.orderBy(desc(\"value\")).show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Example with CplexTransformer\n",
    "\n",
    "To illustrate the usage of the __$CplexTransformer$__, let's remove the constraint on the minimum amount for nutrients, and reformulate the problem as a cost maximization.\n",
    "\n",
    "First, let's define a new dataframe for the constraints matrix by removing the `min` column from the `food_nutrients_df` dataframe so that it is a well-formed input matrix for the __$CplexTransformer$__:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "food_nutrients_LP_df = food_nutrients_df.select([item for item in food_nutrients_df.columns if item not in ['min']])\n",
    "food_nutrients_LP_df.show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from docplex.mp.sparktrans.transformers import CplexTransformer\n",
    "\n",
    "# Create the optimization transformer to calculate the optimal quantity for each food for a balanced diet.\n",
    "# Here, let's use the CplexTransformer by specifying only a maximum amount for each nutrient.\n",
    "cplexSolve = CplexTransformer(rhsCol='max', ubs=food_maxs)\n",
    "\n",
    "# Make evaluation on input data. Additional parameters are specified using the 'params' dictionary\n",
    "# Since there is no lower range for decision variables, let's maximize cost instead! (otherwise, the result is all 0's)\n",
    "diet_max_cost_df = cplexSolve.transform(food_nutrients_LP_df, params={cplexSolve.y: food_costs, cplexSolve.sense: 'max'})\n",
    "diet_max_cost_df.orderBy(desc(\"value\")).show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%matplotlib inline\n",
    "import matplotlib.pyplot as plt\n",
    "\n",
    "def plot_radar_chart(labels, stats, **kwargs):\n",
    "    angles=np.linspace(0, 2*np.pi, len(labels), endpoint=False)\n",
    "    # close the plot\n",
    "    stats  = np.concatenate((stats, [stats[0]]))\n",
    "    angles = np.concatenate((angles, [angles[0]]))\n",
    "    fig = plt.figure()\n",
    "    ax = fig.add_subplot(111, polar=True)\n",
    "    ax.plot(angles, stats, 'o-', linewidth=2, **kwargs)\n",
    "    ax.fill(angles, stats, alpha=0.30, **kwargs)\n",
    "    ax.set_thetagrids(angles * 180/np.pi, labels)\n",
    "    #ax.set_title([df.loc[386,\"Name\"]])\n",
    "    ax.grid(True)\n",
    "    \n",
    "diet = diet_df.toPandas()\n",
    "plot_radar_chart(labels=diet['name'], stats=diet['value'], color='r')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "diet_max_cost = diet_max_cost_df.toPandas()\n",
    "plot_radar_chart(labels=diet_max_cost['name'], stats=diet_max_cost['value'], color='r')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.1"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
