{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Kernel time series prediction\n",
    "\n",
    "This notebook generates reference results on all tree data sets using _Time Series Prediction for Graphs in Kernel and Dissimilarity Spaces_ ([Paaßen, Göpfert, and Hammer, 2018](https://arxiv.org/abs/1704.06498)). This method performs a prediction in kernel space as follows. Let $\\{(x_i, y_i)\\}_{i \\in \\{1, \\ldots, m\\}}$ be the training data where $y_i$ is the true successor of $x_i$. Then, the prediction $\\phi(y)$ in kernel space for the input tree $x$ is given as\n",
    "\n",
    "$$\\phi(y) = \\phi(x) + \\sum_i \\gamma_i \\cdot [ \\phi(y_i) - \\phi(x_i) ]$$\n",
    "\n",
    "where $\\vec \\gamma = \\big( K + \\sigma^2 \\cdot I \\big)^{-1} \\cdot \\vec k(x)$ are the coefficients computed by Gaussian process/kernel prediction, where $K$ is the kernel matrix between all training data and where $\\vec k(x)$ is the vector of kernel values from $x$ to all training trees. The kernel we use here is $k(x, y) = \\exp\\big(-\\frac{1}{2 \\psi^2} \\cdot d(x, y)^2 \\big)$, where $d$ is the tree edit distance and $\\psi$ is a hyper-parameter called bandwidth. This is the suggested kernel of ([Paaßen, Göpfert, and Hammer, 2018](https://arxiv.org/abs/1704.06498)).\n",
    "\n",
    "To map back from kernel space to a tree we employ the scheme recommended by the authors in ([Paaßen et al., 2018](https://jedm.educationaldatamining.org/index.php/JEDM/article/view/158)). In particular, we greedily apply edits $\\delta$ to $x$ that move towards training data trees $y_i$ with positive coefficients $\\gamma_i$ or towards training data trees $x_i$ with negative coefficients $\\gamma_i$ and which reduce the loss\n",
    "\n",
    "$$\\ell(\\delta) = d(x, \\delta(x))^2 + \\sum_i \\gamma_i [ d(\\delta(x), y_i)^2 - d(\\delta(x), x_i)^2) ]$$\n",
    "\n",
    "until no edit $\\delta$ is found anymore which reduces the loss.\n",
    "\n",
    "For even more details, please refer to the source code at `kernel_time_series_prediction.py`."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Boolean Formulae and Peano Addition\n",
    "\n",
    "Refer to `boolean_formulae.py` and `peano_additon.py` for details on the datasets."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "\n",
      " --- task boolean --- \n",
      "\n",
      "repeat 1 of 5\n",
      "took 0.137753 seconds to train\n",
      "took 0.470501 seconds to predict\n",
      "RMSE: 2.41091 versus baseline RMSE 1.45774\n",
      "repeat 2 of 5\n",
      "took 0.129635 seconds to train\n",
      "took 0.469569 seconds to predict\n",
      "RMSE: 2.43812 versus baseline RMSE 1.61589\n",
      "repeat 3 of 5\n",
      "took 0.135248 seconds to train\n",
      "took 0.419243 seconds to predict\n",
      "RMSE: 2.46306 versus baseline RMSE 1.52753\n",
      "repeat 4 of 5\n",
      "took 0.130834 seconds to train\n",
      "took 0.215008 seconds to predict\n",
      "RMSE: 2 versus baseline RMSE 1.7097\n",
      "repeat 5 of 5\n",
      "took 0.131168 seconds to train\n",
      "took 0.382734 seconds to predict\n",
      "RMSE: 2.65684 versus baseline RMSE 1.59041\n",
      "\n",
      "\n",
      " --- task peano --- \n",
      "\n",
      "repeat 1 of 5\n",
      "took 3.70749 seconds to train\n",
      "took 58.3135 seconds to predict\n",
      "RMSE: 5.38069 versus baseline RMSE 3\n",
      "repeat 2 of 5\n",
      "took 4.74174 seconds to train\n",
      "took 65.4441 seconds to predict\n",
      "RMSE: 3.96863 versus baseline RMSE 2.79322\n",
      "repeat 3 of 5\n",
      "took 3.83915 seconds to train\n",
      "took 61.2994 seconds to predict\n",
      "RMSE: 4.89415 versus baseline RMSE 3.44944\n",
      "repeat 4 of 5\n",
      "took 4.83539 seconds to train\n",
      "took 60.4223 seconds to predict\n",
      "RMSE: 3.52332 versus baseline RMSE 2.77302\n",
      "repeat 5 of 5\n",
      "took 5.58519 seconds to train\n",
      "took 88.7766 seconds to predict\n",
      "RMSE: 4.94673 versus baseline RMSE 3.19398\n"
     ]
    }
   ],
   "source": [
    "# evaluate kernel tree time series prediction on both tasks in five repeats\n",
    "import json\n",
    "import os\n",
    "import random\n",
    "import time\n",
    "import numpy as np\n",
    "import edist.tree_utils as tu\n",
    "import edist.tree_edits as te\n",
    "import edist.ted as ted\n",
    "import boolean_formulae\n",
    "import peano_addition\n",
    "from kernel_time_series_prediction import KernelTreePredictor\n",
    "\n",
    "R = 5\n",
    "# the number of time series we generate for training\n",
    "M = 100\n",
    "# the number of time series for testing\n",
    "N_test = 10\n",
    "\n",
    "# model hyper parameters\n",
    "psi   = None\n",
    "sigma = 1E-3\n",
    "\n",
    "tasks = ['boolean', 'peano']\n",
    "\n",
    "errors = np.zeros((len(tasks), R))\n",
    "baseline_errors = np.zeros((len(tasks), R))\n",
    "times  = np.zeros((len(tasks), R))\n",
    "prediction_times  = np.zeros((len(tasks), R))\n",
    "\n",
    "# iterate over all tasks\n",
    "for task_idx in range(len(tasks)):\n",
    "    task = tasks[task_idx]\n",
    "    print('\\n\\n --- task %s --- \\n' % task)\n",
    "    if task == 'boolean':\n",
    "        sampling_fun = boolean_formulae.generate_time_series\n",
    "    else:\n",
    "        sampling_fun = peano_addition.generate_time_series\n",
    "\n",
    "    # do repeats\n",
    "    for r in range(R):\n",
    "        print('repeat %d of %d' % (r+1, R))\n",
    "        # sample random training data\n",
    "        training_data = []\n",
    "        for i in range(M):\n",
    "            training_data.append(sampling_fun())\n",
    "        # fit model\n",
    "        start = time.time()\n",
    "        model = KernelTreePredictor(psi = psi, sigma = sigma)\n",
    "        model.fit(training_data)\n",
    "        times[task_idx, r] = time.time() - start\n",
    "        print('took %g seconds to train' % times[task_idx, r])\n",
    "        # evaluate it on the test data\n",
    "        rmse = 0.\n",
    "        baseline_rmse = 0.\n",
    "        m = 0\n",
    "        for i in range(N_test):\n",
    "            # sample test time series\n",
    "            time_series = sampling_fun()\n",
    "            for t in range(len(time_series)-1):\n",
    "                nodes, adj = time_series[t][0], time_series[t][1]\n",
    "                # predict next tree\n",
    "                start = time.time()\n",
    "                _, nodes_pred, adj_pred = model.predict(nodes, adj)\n",
    "                prediction_times[task_idx, r] += time.time() - start\n",
    "                # compare to actual next tree\n",
    "                nodes_act, adj_act = time_series[t+1][0], time_series[t+1][1]\n",
    "                d = ted.ted(nodes, adj, nodes_act, adj_act)\n",
    "                baseline_rmse += d * d\n",
    "                if nodes_pred != nodes_act or adj_pred != adj_act:\n",
    "                    d = ted.ted(nodes_pred, adj_pred, nodes_act, adj_act)\n",
    "                    rmse += d * d\n",
    "            m += len(time_series)\n",
    "        rmse = np.sqrt(rmse / m)\n",
    "        baseline_rmse = np.sqrt(baseline_rmse / m)\n",
    "        prediction_times[task_idx, r] /= m\n",
    "        print('took %g seconds to predict' % prediction_times[task_idx, r])\n",
    "        errors[task_idx, r] = rmse\n",
    "        baseline_errors[task_idx, r] = baseline_rmse\n",
    "        print('RMSE: %g versus baseline RMSE %g' % (rmse, baseline_rmse))"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.9"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
