{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "from IPython.core.display import display, HTML\n",
    "display(HTML(\"<style>.container { width:100% !important; }</style>\"))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Embedding CPLEX in a ML Spark Pipeline\n",
    "`Spark ML` provides a uniform set of high-level APIs that help users create and tune practical machine learning pipelines.\n",
    "\n",
    "In this notebook, we show how to embed CPLEX as a Spark _transformer_ class.\n",
    "DOcplex provides transformer classes that take a matrix `X` of constraints and a vector `y` of costs and solves a linear problem using CPLEX.\n",
    "\n",
    "Transformer classes share a `solve(X, Y, **params)` method which expects:\n",
    " - an X matrix containing the constraints of the linear problem\n",
    " - a Y vector containing the cost coefficients.\n",
    " \n",
    "The transformer classes requires a Spark DataFrame for the 'X' matrix, and support various formats for the 'Y' vector:\n",
    "\n",
    "- Python lists,\n",
    "- numpy vector,\n",
    "- pandas Series,\n",
    "- Spark columns\n",
    "\n",
    "The same formats are also supported to optionally specify upper bounds for decision variables.\n",
    "\n",
    " \n",
    "## DOcplex transformer classes\n",
    "\n",
    "There are two DOcplex transformer classes:\n",
    "\n",
    " - __$CplexTransformer$__ expects to solve a linear problem in the classical form: \n",
    " \n",
    " $$ minimize\\  C^{t} x\\\\ s.t.\\\\ \n",
    " Ax <= B$$\n",
    " \n",
    " Where $A$ is a (M,N) matrix describing the constraints and $B$ is a scalar vector of size M, containing the _right hand sides_ of the constraints, and $C$ is the _cost vector_ of size N. In this case the transformer expects a (M,N+1) matrix, where the last column contains the right hand sides.\n",
    " \n",
    " - __$CplexRangeTransformer$__ expects to solve linear problem as a set of _range_ constraints:\n",
    " \n",
    "  $$ minimize\\  C^{t} x\\\\ s.t.\\\\ \n",
    " m <= Ax <= M$$\n",
    " \n",
    "  Where $A$ is a (M,N) matrix describing the constraints, $m$ and $M$ are two scalar vectors of size M, containing the _minimum_ and _maximum_ values for the row expressions, and $C$ is the _cost vector_ of size N. In this case the transformer expects a (M,N+2) matrix, where the last two columns contains the minimum and maximum values (in this order).\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "try:\n",
    "    import numpy as np\n",
    "except ImportError:\n",
    "    raise RuntimError('This notebook requires numpy')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "In the next section we illustrate the range transformer with the Diet Problem, from DOcplex distributed examples.\n",
    "\n",
    "\n",
    "## The Diet Problem\n",
    "\n",
    "The diet problem is delivered in the DOcplex examples.\n",
    "\n",
    "Given a breakdown matrix of various foods in elementary nutrients, plus limitations on quantities for foods an nutrients, and food costs, the goal is to find the optimal quantity for each food for a balanced diet.\n",
    "\n",
    "The __FOOD_NUTRIENTS__ data intentionally contains a missing value ($np.nan$) to illustrate the use of a pipeline involving a data cleansing stage."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# the baseline diet data as Python lists of tuples.\n",
    "FOODS = [\n",
    "    (\"Roasted Chicken\", 0.84, 0, 10),\n",
    "    (\"Spaghetti W/ Sauce\", 0.78, 0, 10),\n",
    "    (\"Tomato,Red,Ripe,Raw\", 0.27, 0, 10),\n",
    "    (\"Apple,Raw,W/Skin\", .24, 0, 10),\n",
    "    (\"Grapes\", 0.32, 0, 10),\n",
    "    (\"Chocolate Chip Cookies\", 0.03, 0, 10),\n",
    "    (\"Lowfat Milk\", 0.23, 0, 10),\n",
    "    (\"Raisin Brn\", 0.34, 0, 10),\n",
    "    (\"Hotdog\", 0.31, 0, 10)\n",
    "]\n",
    "\n",
    "NUTRIENTS = [\n",
    "    (\"Calories\", 2000, 2500),\n",
    "    (\"Calcium\", 800, 1600),\n",
    "    (\"Iron\", 10, 30),\n",
    "    (\"Vit_A\", 5000, 50000),\n",
    "    (\"Dietary_Fiber\", 25, 100),\n",
    "    (\"Carbohydrates\", 0, 300),\n",
    "    (\"Protein\", 50, 100)\n",
    "]\n",
    "\n",
    "FOOD_NUTRIENTS = [\n",
    "#     (\"Roasted Chicken\", 277.4, 21.9, 1.8, 77.4, 0.0, 0.0, 42.2),\n",
    "    (\"Roasted Chicken\", 277.4, 21.9, 1.8, np.nan, 0.0, 0.0, 42.2),        # Set a value as missing (NaN)\n",
    "    (\"Spaghetti W/ Sauce\", 358.2, 80.2, 2.3, 3055.2, 11.6, 58.3, 8.2),\n",
    "    (\"Tomato,Red,Ripe,Raw\", 25.8, 6.2, 0.6, 766.3, 1.4, 5.7, 1.0),\n",
    "    (\"Apple,Raw,W/Skin\", 81.4, 9.7, 0.2, 73.1, 3.7, 21.0, 0.3),\n",
    "    (\"Grapes\", 15.1, 3.4, 0.1, 24.0, 0.2, 4.1, 0.2),\n",
    "    (\"Chocolate Chip Cookies\", 78.1, 6.2, 0.4, 101.8, 0.0, 9.3, 0.9),\n",
    "    (\"Lowfat Milk\", 121.2, 296.7, 0.1, 500.2, 0.0, 11.7, 8.1),\n",
    "    (\"Raisin Brn\", 115.1, 12.9, 16.8, 1250.2, 4.0, 27.9, 4.0),\n",
    "    (\"Hotdog\", 242.1, 23.5, 2.3, 0.0, 0.0, 18.0, 10.4)\n",
    "]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "nb_foods = len(FOODS)\n",
    "nb_nutrients = len(NUTRIENTS)\n",
    "print('#foods={0}'.format(nb_foods))\n",
    "print('#nutrients={0}'.format(nb_nutrients))\n",
    "\n",
    "assert nb_foods == len(FOOD_NUTRIENTS)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Creating a Spark session"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "try:\n",
    "    import findspark\n",
    "    findspark.init()\n",
    "except ImportError:\n",
    "    # Ignore exception: the 'findspark' module is required when executing Spark in a Windows environment\n",
    "    pass\n",
    "\n",
    "import pyspark  # Only run after findspark.init() (if running in a Windows environment)\n",
    "from pyspark.sql import SparkSession\n",
    "\n",
    "spark = SparkSession.builder.getOrCreate()\n",
    "\n",
    "if spark.version < '2.2':\n",
    "    raise \"This notebook requires at least version '2.2' for PySpark\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Using the transformer with a Spark dataframe\n",
    "\n",
    "In this section we show how to use a transformer with data stored in a Spark dataframe."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Prepare the data as a numpy matrix\n",
    "\n",
    "In this section we build a numpy matrix to be passed to the transformer.\n",
    "\n",
    "First, we extract the food to nutrient matrix by stripping the names."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "mat_fn = np.matrix([FOOD_NUTRIENTS[f][1:] for f in range(nb_foods)])\n",
    "print('The food-nutrient matrix has shape: {0}'.format(mat_fn.shape))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Then we extract the two vectors of min/max for each nutrient. Each vector has nb_nutrients elements.\n",
    "We also break the `FOODS` collection of tuples into columns"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "nutrient_mins = [NUTRIENTS[n][1] for n in range(nb_nutrients)]\n",
    "nutrient_maxs = [NUTRIENTS[n][2] for n in range(nb_nutrients)]\n",
    "\n",
    "food_names ,food_costs, food_mins, food_maxs = map(list, zip(*FOODS))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We are now ready to prepare the transformer matrix. This matrix has shape (7, 11) as we\n",
    "have 7 nutrients and 9 foods, plus the additional `min` and `max` columns"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# step 1. add two lines for nutrient mins, maxs\n",
    "nf2 = np.append(mat_fn, np.matrix([nutrient_mins, nutrient_maxs]), axis=0)\n",
    "mat_nf = nf2.transpose()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "mat_nf.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Populate a Spark dataframe with the matrix data\n",
    "\n",
    "In this section we build a Spark dataframe matrix to be passed to the transformer.\n",
    "\n",
    "Using a Spark dataframe will also allow us to chain multiple transformers in a pipeline."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from pyspark.sql import SQLContext\n",
    "\n",
    "sc = spark.sparkContext\n",
    "sqlContext = SQLContext(sc)\n",
    "\n",
    "columns = food_names + ['min', 'max']\n",
    "food_nutrients_df = sqlContext.createDataFrame(mat_nf.tolist(), columns)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's display the dataframe schema and content"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "food_nutrients_df.printSchema()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "food_nutrients_df.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Chaining a data cleansing stage with the $CplexRangeTransformer$ in a Pipeline\n",
    "\n",
    "To use the transformer, create an instance and pass the following parameters to the `transform` method\n",
    " - the `X` matrix of size(M, N+2) containing coefficients for N column variables plus two addition column for range mins and maxs.\n",
    " - the `Y` cost vector (using __\"y\"__ parameter id)\n",
    " - whether one wants to solve a minimization (`min`) or maximization (`max`) problem (using __\"sense\"__ parameter id)\n",
    " \n",
    "In addition, some data elements that can't be encoded in the matrix itself should be passed as keyword arguments:\n",
    "\n",
    "- `ubs` denotes the upper bound for the column variables that are created. The expected size of this scalar vector is N (when matrix has size (M,N+2))\n",
    "- `minCol` and `maxCol` are the names of the columns corresponding to the constraints min and max range in the `X` matrix"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Since the input data contains some missing values, we'll actually define a pipeline that will:\n",
    "- first, perform a data cleansing stage: here missing values are replaced by the column mean value\n",
    "- then, perform the optimization stage: the Cplex transformer will be invoked using the output dataframe from the cleansing stage as the constraint matrix."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "from docplex.mp.sparktrans.transformers import CplexRangeTransformer\n",
    "from pyspark.ml.feature import Imputer\n",
    "from pyspark.ml import Pipeline\n",
    "from pyspark.sql.functions import *\n",
    "\n",
    "# Create a data cleansing stage to replace missing values with column mean value\n",
    "data_cleansing = Imputer(inputCols=food_names, outputCols=food_names) \n",
    "\n",
    "# Create an optimization stage to calculate the optimal quantity for each food for a balanced diet.\n",
    "cplexSolve = CplexRangeTransformer(minCol='min', maxCol='max', ubs=food_maxs)\n",
    "\n",
    "# Configure an ML pipeline, which chains these two stages\n",
    "pipeline = Pipeline(stages=[data_cleansing, cplexSolve])\n",
    "\n",
    "# Fit the pipeline: during this step, the data cleansing estimator is configured\n",
    "model = pipeline.fit(food_nutrients_df)\n",
    "\n",
    "# Make evaluation on input data. One can still specify stage-specific parameters when invoking 'transform' on the PipelineModel\n",
    "diet_df = model.transform(food_nutrients_df, params={cplexSolve.y: food_costs, cplexSolve.sense: 'min'})\n",
    "diet_df.orderBy(desc(\"value\")).show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Just for checking purpose, let's have a look at the Spark dataframe at the output of the cleansing stage.<br>\n",
    "This is the dataframe that is fed to the __$CplexRangeTransformer$__ in the pipeline.<br>\n",
    "One can check that the first entry in the fourth row has been set to the average of the other values in the same column ($57.2167$)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "data_cleansing.fit(food_nutrients_df).transform(food_nutrients_df).show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Example with CplexTransformer\n",
    "\n",
    "To illustrate the usage of the __$CplexTransformer$__, let's remove the constraint on the minimum amount for nutrients, and reformulate the problem as a cost maximization.\n",
    "\n",
    "First, let's define a new dataframe for the constraints matrix by removing the `min` column from the `food_nutrients_df` dataframe so that it is a well-formed input matrix for the __$CplexTransformer$__:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "food_nutrients_LP_df = food_nutrients_df.select([item for item in food_nutrients_df.columns if item not in ['min']])\n",
    "food_nutrients_LP_df.show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "from docplex.mp.sparktrans.transformers import CplexTransformer\n",
    "\n",
    "# Create a data cleansing stage to replace missing values with column mean value\n",
    "data_cleansing = Imputer(inputCols=food_names, outputCols=food_names) \n",
    "\n",
    "# Create an optimization stage to calculate the optimal quantity for each food for a balanced diet.\n",
    "# Here, let's use the CplexTransformer by specifying only a maximum amount for each nutrient.\n",
    "cplexSolve = CplexTransformer(rhsCol='max', ubs=food_maxs)\n",
    "\n",
    "# Configure an ML pipeline, which chains these two stages\n",
    "pipeline = Pipeline(stages=[data_cleansing, cplexSolve])\n",
    "\n",
    "# Fit the pipeline: during this step, the data cleansing estimator is configured\n",
    "model = pipeline.fit(food_nutrients_LP_df)\n",
    "\n",
    "# Make evaluation on input data. One can still specify stage-specific parameters when invoking 'transform' on the PipelineModel\n",
    "# Since there is no lower range for decision variables, let's maximize cost instead! (otherwise, the result is all 0's)\n",
    "diet_max_cost_df = model.transform(food_nutrients_LP_df, params={cplexSolve.y: food_costs, cplexSolve.sense: 'max'})\n",
    "diet_max_cost_df.orderBy(desc(\"value\")).show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "%matplotlib inline\n",
    "import matplotlib.pyplot as plt\n",
    "\n",
    "def plot_radar_chart(labels, stats, **kwargs):\n",
    "    angles=np.linspace(0, 2*np.pi, len(labels), endpoint=False)\n",
    "    # close the plot\n",
    "    stats  = np.concatenate((stats, [stats[0]]))\n",
    "    angles = np.concatenate((angles, [angles[0]]))\n",
    "    fig = plt.figure()\n",
    "    ax = fig.add_subplot(111, polar=True)\n",
    "    ax.plot(angles, stats, 'o-', linewidth=2, **kwargs)\n",
    "    ax.fill(angles, stats, alpha=0.30, **kwargs)\n",
    "    ax.set_thetagrids(angles * 180/np.pi, labels)\n",
    "    #ax.set_title([df.loc[386,\"Name\"]])\n",
    "    ax.grid(True)\n",
    "    \n",
    "diet = diet_df.toPandas()\n",
    "plot_radar_chart(labels=diet['name'], stats=diet['value'], color='r')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "diet_max_cost = diet_max_cost_df.toPandas()\n",
    "plot_radar_chart(labels=diet_max_cost['name'], stats=diet_max_cost['value'], color='r')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 2",
   "language": "python",
   "name": "python2"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.11"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
