{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# VIME Tutorial\n",
    "\n",
    "### VIME: Extending the Success of Self- and Semi-supervised Learning to Tabular Domain\n",
    "\n",
    "- Paper: Jinsung Yoon, Yao Zhang, James Jordon, Mihaela van der Schaar, \n",
    "  \"VIME: Extending the Success of Self- and Semi-supervised Learning to Tabular Domain,\" \n",
    "  Neural Information Processing Systems (NeurIPS), 2020.\n",
    "\n",
    "- Paper link: TBD\n",
    "\n",
    "- Last updated Date: October 11th 2020\n",
    "\n",
    "- Code author: Jinsung Yoon (jsyoon0823@gmail.com)\n",
    "\n",
    "This notebook describes the user-guide of self- and semi-supervised learning for tabular domain using MNIST database."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Prerequisite\n",
    "Clone https://github.com/jsyoon0823/VIME.git to the current directory."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Necessary packages and functions call\n",
    "\n",
    "- data_loader: MNIST dataset loading and preprocessing\n",
    "- supervised_models: supervised learning models (Logistic regression, XGBoost, and Multi-layer Perceptron)\n",
    "\n",
    "- vime_self: Self-supervised learning part of VIME framework\n",
    "- vime_semi: Semi-supervised learning part of VIME framework\n",
    "- vime_utils: Some utility functions for VIME framework"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Using TensorFlow backend.\n"
     ]
    }
   ],
   "source": [
    "import numpy as np\n",
    "import os\n",
    "import warnings\n",
    "warnings.filterwarnings(\"ignore\")\n",
    "  \n",
    "from data_loader import load_mnist_data\n",
    "from supervised_models import logit, xgb_model, mlp\n",
    "\n",
    "from vime_self import vime_self\n",
    "from vime_semi import vime_semi\n",
    "from vime_utils import perf_metric"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Set the parameters and define output\n",
    "\n",
    "-   label_no: Number of labeled data to be used\n",
    "-   model_sets: supervised model set (mlp, logit, or xgboost)\n",
    "-   p_m: corruption probability for self-supervised learning\n",
    "-   alpha: hyper-parameter to control the weights of feature and mask losses\n",
    "-   K: number of augmented samples\n",
    "-   beta: hyperparameter to control supervised and unsupervised loss\n",
    "-   label_data_rate: ratio of labeled data\n",
    "-   metric: prediction performance metric (either acc or auc)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Experimental parameters\n",
    "label_no = 1000  \n",
    "model_sets = ['logit','xgboost','mlp']\n",
    "  \n",
    "# Hyper-parameters\n",
    "p_m = 0.3\n",
    "alpha = 2.0\n",
    "K = 3\n",
    "beta = 1.0\n",
    "label_data_rate = 0.1\n",
    "\n",
    "# Metric\n",
    "metric = 'acc'\n",
    "  \n",
    "# Define output\n",
    "results = np.zeros([len(model_sets)+2])  "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Load data\n",
    "\n",
    "Load original MNIST dataset and preprocess the loaded data.\n",
    "- Only select the subset of data as the labeled data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Load data\n",
    "x_train, y_train, x_unlab, x_test, y_test = load_mnist_data(label_data_rate)\n",
    "    \n",
    "# Use subset of labeled data\n",
    "x_train = x_train[:label_no, :]\n",
    "y_train = y_train[:label_no, :]  "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Train supervised models\n",
    "\n",
    "- Train 3 supervised learning models (Logistic regression, XGBoost, MLP)\n",
    "- Save the performances of each supervised model."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "WARNING:tensorflow:From /home/vdslab/anaconda3/lib/python3.7/site-packages/tensorflow_core/python/ops/resource_variable_ops.py:1630: calling BaseResourceVariable.__init__ (from tensorflow.python.ops.resource_variable_ops) with constraint is deprecated and will be removed in a future version.\n",
      "Instructions for updating:\n",
      "If using Keras pass *_constraint arguments to layers.\n",
      "WARNING:tensorflow:From /home/vdslab/anaconda3/lib/python3.7/site-packages/keras/backend/tensorflow_backend.py:422: The name tf.global_variables is deprecated. Please use tf.compat.v1.global_variables instead.\n",
      "\n",
      "Restoring model weights from the end of the best epoch\n",
      "Epoch 00071: early stopping\n",
      "Supervised Performance, Model Name: logit, Performance: 0.8721\n",
      "Supervised Performance, Model Name: xgboost, Performance: 0.8763\n",
      "Supervised Performance, Model Name: mlp, Performance: 0.8935\n"
     ]
    }
   ],
   "source": [
    "# Logistic regression\n",
    "y_test_hat = logit(x_train, y_train, x_test)\n",
    "results[0] = perf_metric(metric, y_test, y_test_hat) \n",
    "\n",
    "# XGBoost\n",
    "y_test_hat = xgb_model(x_train, y_train, x_test)    \n",
    "results[1] = perf_metric(metric, y_test, y_test_hat)   \n",
    "\n",
    "# MLP\n",
    "mlp_parameters = dict()\n",
    "mlp_parameters['hidden_dim'] = 100\n",
    "mlp_parameters['epochs'] = 100\n",
    "mlp_parameters['activation'] = 'relu'\n",
    "mlp_parameters['batch_size'] = 100\n",
    "      \n",
    "y_test_hat = mlp(x_train, y_train, x_test, mlp_parameters)\n",
    "results[2] = perf_metric(metric, y_test, y_test_hat)\n",
    "\n",
    "# Report performance\n",
    "for m_it in range(len(model_sets)):  \n",
    "    \n",
    "  model_name = model_sets[m_it]  \n",
    "    \n",
    "  print('Supervised Performance, Model Name: ' + model_name + \n",
    "        ', Performance: ' + str(results[m_it]))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Train & Test VIME-Self\n",
    "Train self-supervised part of VIME framework only\n",
    "- Check the performance of self-supervised part of VIME framework."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "WARNING:tensorflow:From /home/vdslab/anaconda3/lib/python3.7/site-packages/tensorflow_core/python/ops/nn_impl.py:183: where (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version.\n",
      "Instructions for updating:\n",
      "Use tf.where in 2.0, which has the same broadcast rule as np.where\n",
      "Epoch 1/10\n",
      "54000/54000 [==============================] - 4s 68us/step - loss: 0.2789 - mask_loss: 0.2266 - feature_loss: 0.0261\n",
      "Epoch 2/10\n",
      "54000/54000 [==============================] - 4s 66us/step - loss: 0.2440 - mask_loss: 0.2167 - feature_loss: 0.0137\n",
      "Epoch 3/10\n",
      "54000/54000 [==============================] - 4s 66us/step - loss: 0.2360 - mask_loss: 0.2126 - feature_loss: 0.0117\n",
      "Epoch 4/10\n",
      "54000/54000 [==============================] - 4s 67us/step - loss: 0.2298 - mask_loss: 0.2082 - feature_loss: 0.0108\n",
      "Epoch 5/10\n",
      "54000/54000 [==============================] - 4s 67us/step - loss: 0.2240 - mask_loss: 0.2033 - feature_loss: 0.0103\n",
      "Epoch 6/10\n",
      "54000/54000 [==============================] - 4s 67us/step - loss: 0.2183 - mask_loss: 0.1983 - feature_loss: 0.0100\n",
      "Epoch 7/10\n",
      "54000/54000 [==============================] - 4s 68us/step - loss: 0.2131 - mask_loss: 0.1935 - feature_loss: 0.0098\n",
      "Epoch 8/10\n",
      "54000/54000 [==============================] - 4s 68us/step - loss: 0.2085 - mask_loss: 0.1892 - feature_loss: 0.0096\n",
      "Epoch 9/10\n",
      "54000/54000 [==============================] - 4s 65us/step - loss: 0.2043 - mask_loss: 0.1853 - feature_loss: 0.0095\n",
      "Epoch 10/10\n",
      "54000/54000 [==============================] - 4s 65us/step - loss: 0.2006 - mask_loss: 0.1818 - feature_loss: 0.0094\n",
      "Restoring model weights from the end of the best epoch\n",
      "Epoch 00058: early stopping\n",
      "VIME-Self Performance: 0.9037\n"
     ]
    }
   ],
   "source": [
    "# Train VIME-Self\n",
    "vime_self_parameters = dict()\n",
    "vime_self_parameters['batch_size'] = 128\n",
    "vime_self_parameters['epochs'] = 10\n",
    "vime_self_encoder = vime_self(x_unlab, p_m, alpha, vime_self_parameters)\n",
    "  \n",
    "# Save encoder\n",
    "if not os.path.exists('save_model'):\n",
    "  os.makedirs('save_model')\n",
    "\n",
    "file_name = './save_model/encoder_model.h5'\n",
    "  \n",
    "vime_self_encoder.save(file_name)  \n",
    "        \n",
    "# Test VIME-Self\n",
    "x_train_hat = vime_self_encoder.predict(x_train)\n",
    "x_test_hat = vime_self_encoder.predict(x_test)\n",
    "      \n",
    "y_test_hat = mlp(x_train_hat, y_train, x_test_hat, mlp_parameters)\n",
    "results[3] = perf_metric(metric, y_test, y_test_hat)\n",
    "    \n",
    "print('VIME-Self Performance: ' + str(results[3]))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Train & Test VIME\n",
    "\n",
    "Train semi-supervised part of VIME framework on top of trained self-supervised encoder\n",
    "- Check the performance of entire part of VIME framework."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "WARNING:tensorflow:From /home/vdslab/Documents/Jinsung/2020_Research/ICML/SSL/Clean_Code/VIME_GitHub_Final/VIME/vime_semi.py:66: The name tf.placeholder is deprecated. Please use tf.compat.v1.placeholder instead.\n",
      "\n",
      "WARNING:tensorflow:From /home/vdslab/Documents/Jinsung/2020_Research/ICML/SSL/Clean_Code/VIME_GitHub_Final/VIME/vime_semi.py:83: The name tf.variable_scope is deprecated. Please use tf.compat.v1.variable_scope instead.\n",
      "\n",
      "WARNING:tensorflow:From /home/vdslab/Documents/Jinsung/2020_Research/ICML/SSL/Clean_Code/VIME_GitHub_Final/VIME/vime_semi.py:83: The name tf.AUTO_REUSE is deprecated. Please use tf.compat.v1.AUTO_REUSE instead.\n",
      "\n",
      "WARNING:tensorflow:From /home/vdslab/anaconda3/lib/python3.7/site-packages/tensorflow_core/contrib/layers/python/layers/layers.py:1866: Layer.apply (from tensorflow.python.keras.engine.base_layer) is deprecated and will be removed in a future version.\n",
      "Instructions for updating:\n",
      "Please use `layer.__call__` method instead.\n",
      "WARNING:tensorflow:From /home/vdslab/Documents/Jinsung/2020_Research/ICML/SSL/Clean_Code/VIME_GitHub_Final/VIME/vime_semi.py:105: The name tf.losses.softmax_cross_entropy is deprecated. Please use tf.compat.v1.losses.softmax_cross_entropy instead.\n",
      "\n",
      "WARNING:tensorflow:From /home/vdslab/Documents/Jinsung/2020_Research/ICML/SSL/Clean_Code/VIME_GitHub_Final/VIME/vime_semi.py:110: The name tf.trainable_variables is deprecated. Please use tf.compat.v1.trainable_variables instead.\n",
      "\n",
      "WARNING:tensorflow:From /home/vdslab/Documents/Jinsung/2020_Research/ICML/SSL/Clean_Code/VIME_GitHub_Final/VIME/vime_semi.py:113: The name tf.train.AdamOptimizer is deprecated. Please use tf.compat.v1.train.AdamOptimizer instead.\n",
      "\n",
      "WARNING:tensorflow:From /home/vdslab/anaconda3/lib/python3.7/site-packages/keras/backend/tensorflow_backend.py:431: The name tf.is_variable_initialized is deprecated. Please use tf.compat.v1.is_variable_initialized instead.\n",
      "\n",
      "WARNING:tensorflow:From /home/vdslab/anaconda3/lib/python3.7/site-packages/keras/backend/tensorflow_backend.py:438: The name tf.variables_initializer is deprecated. Please use tf.compat.v1.variables_initializer instead.\n",
      "\n",
      "WARNING:tensorflow:From /home/vdslab/Documents/Jinsung/2020_Research/ICML/SSL/Clean_Code/VIME_GitHub_Final/VIME/vime_semi.py:124: The name tf.Session is deprecated. Please use tf.compat.v1.Session instead.\n",
      "\n",
      "WARNING:tensorflow:From /home/vdslab/Documents/Jinsung/2020_Research/ICML/SSL/Clean_Code/VIME_GitHub_Final/VIME/vime_semi.py:125: The name tf.global_variables_initializer is deprecated. Please use tf.compat.v1.global_variables_initializer instead.\n",
      "\n",
      "WARNING:tensorflow:From /home/vdslab/Documents/Jinsung/2020_Research/ICML/SSL/Clean_Code/VIME_GitHub_Final/VIME/vime_semi.py:129: The name tf.train.Saver is deprecated. Please use tf.compat.v1.train.Saver instead.\n",
      "\n",
      "Iteration: 0/1000, Current loss: 2.2812\n",
      "Iteration: 100/1000, Current loss: 0.4304\n",
      "Iteration: 200/1000, Current loss: 0.3922\n",
      "Iteration: 300/1000, Current loss: 0.3689\n",
      "Iteration: 400/1000, Current loss: 0.3402\n",
      "Iteration: 500/1000, Current loss: 0.3717\n",
      "WARNING:tensorflow:From /home/vdslab/Documents/Jinsung/2020_Research/ICML/SSL/Clean_Code/VIME_GitHub_Final/VIME/vime_semi.py:188: The name tf.train.import_meta_graph is deprecated. Please use tf.compat.v1.train.import_meta_graph instead.\n",
      "\n",
      "INFO:tensorflow:Restoring parameters from ./save_model/class_model.ckpt\n",
      "VIME Performance: 0.9244\n"
     ]
    }
   ],
   "source": [
    "# Train VIME-Semi\n",
    "vime_semi_parameters = dict()\n",
    "vime_semi_parameters['hidden_dim'] = 100\n",
    "vime_semi_parameters['batch_size'] = 128\n",
    "vime_semi_parameters['iterations'] = 1000\n",
    "y_test_hat = vime_semi(x_train, y_train, x_unlab, x_test, \n",
    "                       vime_semi_parameters, p_m, K, beta, file_name)\n",
    "\n",
    "# Test VIME\n",
    "results[4] = perf_metric(metric, y_test, y_test_hat)\n",
    "  \n",
    "print('VIME Performance: '+ str(results[4]))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Report Prediction Performances\n",
    "\n",
    "- 3 Supervised learning models\n",
    "- VIME with self-supervised part only\n",
    "- Entire VIME framework"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Supervised Performance, Model Name: logit, Performance: 0.8721\n",
      "Supervised Performance, Model Name: xgboost, Performance: 0.8763\n",
      "Supervised Performance, Model Name: mlp, Performance: 0.8935\n",
      "VIME-Self Performance: 0.9037\n",
      "VIME Performance: 0.9244\n"
     ]
    }
   ],
   "source": [
    "for m_it in range(len(model_sets)):  \n",
    "    \n",
    "  model_name = model_sets[m_it]  \n",
    "    \n",
    "  print('Supervised Performance, Model Name: ' + model_name + \n",
    "        ', Performance: ' + str(results[m_it]))\n",
    "    \n",
    "print('VIME-Self Performance: ' + str(results[m_it+1]))\n",
    "  \n",
    "print('VIME Performance: '+ str(results[m_it+2]))"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.4"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
