{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "# Training notebook\n",
    "---\n",
    "\n",
    "Notebook showing an example usage of the provided code to collect expert/prior data and replicate the experiments from the paper."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "## Collect expert/prior data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [],
   "source": [
    "from train_expert import train_expert\n",
    "from collect_expert_data import collect_expert_data\n",
    "from collect_prior_data import collect_prior_data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "**Train an 'expert' agent and collect expert trajectories in the source environment:**\n",
    "\n",
    "Tested source environments:\n",
    "\n",
    "1) *Inverted Pendulum* realm:\n",
    "- *InvertedPendulum-v2*\n",
    "- *InvertedDoublePendulum-v2*\n",
    "\n",
    "2) *Reacher* realm:\n",
    "- *ReacherEasy-v2*\n",
    "- *ThreeReacherEasy-v2*\n",
    "\n",
    "3) *Hopper* realm:\n",
    "- *Hopper-v2*\n",
    "\n",
    "4) *Half-Cheetah* realm:\n",
    "- *HalfCheetah-v2*\n",
    "\n",
    "5) *7DOF-Pusher* realm:\n",
    "- *PusherHumanSim-v2*\n",
    "\n",
    "6) *7DOF-Striker* realm:\n",
    "- *StrikerHumanSim-v2*"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "pycharm": {
     "name": "#%%\n"
    },
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "# define source environment\n",
    "env_name = 'StrikerHumanSim-v2'\n",
    "\n",
    "# Train expert agent\n",
    "expert_agent = train_expert(env_name=env_name)\n",
    "\n",
    "# Collect expert trajectories\n",
    "collect_expert_data(agent=expert_agent,\n",
    "                    env_name=env_name,\n",
    "                    max_timesteps=10000,\n",
    "                    expert_samples_location='expert_data')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Collect prior trajectories for the realms:**\n",
    "\n",
    "Tested realms:\n",
    "\n",
    "1) *Inverted Pendulum* realm: *Inverted Pendulum/InvertedPendulum*\n",
    "\n",
    "2) *Reacher* realm: *Reacher*\n",
    "\n",
    "3) *Hopper* realm: *Hopper*\n",
    "\n",
    "4) *Half-Cheetah* realm: *Half-Cheetah/HalfCheetah*\n",
    "\n",
    "5) *7DOF-Pusher* realm: *7DOF-Pusher/Pusher*\n",
    "\n",
    "6) *7DOF-Striker* realm: *7DOF-Striker/Striker*"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "pycharm": {
     "name": "#%%\n"
    },
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "# e.g. define realm\n",
    "env_realm = 'Striker'\n",
    "\n",
    "# Collect prior data\n",
    "collect_prior_data(realm_name=env_realm,\n",
    "                   max_timesteps=10000,\n",
    "                   prior_samples_location='prior_data')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "source": [
    "## Perform *Observational* Imitation"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "heading_collapsed": true
   },
   "source": [
    "### *DisentanGAIL* models"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "hidden": true,
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [],
   "source": [
    "from run_experiment import run_experiment"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "hidden": true,
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "source": [
    "**Replicating the experiments**\n",
    "\n",
    "We provide a function that simply allows to collect data for different variations of the *DisentanGAIL* algorithm.\n",
    "\n",
    "We divide the parameters of this function in three dictionaries, below we provide a brief description on the most relevant parameters to reproduce the experiments with *DisentanGAIL* by using the hyper-parameters given in the supplementary material (*Appendix B*). The parameters that are not specified below can be kept constant for the experiments in all *source/target* environments. \n",
    "\n",
    "Please, refer to our code (mainly *competing_gail.py*) for further details.\n",
    "\n",
    "1) **exp_params**: defines the general parameters of the algorithm:\n",
    "\n",
    ">Specify *exp_name* as the location where to log the results\n",
    ">\n",
    ">---\n",
    "\n",
    ">Specify *exp_num* as the experiment number\n",
    ">\n",
    ">---\n",
    "\n",
    ">Specify *env_name* as the *source* environment name:\n",
    ">\n",
    ">---\n",
    "\n",
    ">For the Inverted Pendulum realm:\n",
    ">>*InvertedPendulum-v2*\n",
    ">>\n",
    ">>*InvertedDoublePendulum-v2*\n",
    "\n",
    "> For the *Reacher* realm:\n",
    ">>*ReacherEasy-v2*\n",
    ">>\n",
    ">>*ThreeReacherEasy-v2*\n",
    "\n",
    ">For the *Half-Cheetah* realm:\n",
    ">>*HalfCheetah-v2*\n",
    "\n",
    ">For the *7DOF-Pusher* realm:\n",
    ">>*PusherHumanSim-v2*\n",
    "\n",
    ">---\n",
    "\n",
    ">Specify *env_type* as the domain difference in the *target* environment:\n",
    ">\n",
    ">---\n",
    "\n",
    ">For the Inverted Pendulum realm:\n",
    ">>*InvertedPendulum-v2*: *expert/colored/to_two/to_colored_two*\n",
    ">>\n",
    ">>*InvertedDoublePendulum-v2*: *expert/colored/to_one/to_colored_one*\n",
    "\n",
    "> For the *Reacher* realm:\n",
    ">>*ReacherEasy-v2*: *expert/tilted/to_three/to_tilted_three*\n",
    ">>\n",
    ">>*ThreeReacherEasy-v2*: *expert/tilted/to_two/to_tilted_two*\n",
    "\n",
    ">For the *Hopper* realm:\n",
    ">>*Hopper-v2*: *expert/flexible*\n",
    "\n",
    ">For the *Half-Cheetah* realm:\n",
    ">>*HalfCheetah-v2*: *expert/to_locked_feet*\n",
    "\n",
    ">For the *7DOF-Pusher* realm:\n",
    ">>*PusherHumanSim-v2*: *expert/to_robot*\n",
    "\n",
    ">For the *7DOF-Striker* realm:\n",
    ">>*StrikerHumanSim-v2*: *expert/to_robot*\n",
    "\n",
    ">---\n",
    "\n",
    ">Specify *epochs* as the maximum number of epochs to run the experiment.\n",
    ">\n",
    ">---\n",
    "\n",
    ">Specify *episode_limit* as the task horizon.\n",
    ">\n",
    ">---\n",
    "\n",
    ">Specify *return_threshold* as an optional early termination condition to stop learning after reaching the wanted performance.\n",
    ">\n",
    ">---\n",
    "\n",
    "2) **learner_params**: defines the parameters of the 'observer' agent's policy:\n",
    "\n",
    ">Specify *l_buffer_size* as the maximum dimension set of visual trajectories collected by the agent\n",
    ">\n",
    ">---\n",
    "\n",
    "3) **discriminator_params**: defines the parameters of the discriminator:\n",
    "\n",
    ">Specify *d_pre_filters* as a list with the filters within each layer of the preprocessor (followed by a Tanh nonlinearity and 2x2 Max-Pooling)\n",
    ">\n",
    ">---\n",
    "\n",
    ">Specify *d_hidden_units* as a list with the number of units within each hidden fully-connected layer of the invariant discriminator (followed by a ReLU/Tanh nonlinearity)\n",
    ">\n",
    ">---\n",
    "\n",
    ">Specify *d_mi_hidden_units* as a list with the number of units within each hidden fully-connected layer of the statistics network (followed by a Tanh nonlinearity)\n",
    ">\n",
    ">---\n",
    "\n",
    ">Specify *n_expert_demos/n_expert_prior_demos/n_agent_prior_demos* as the number of *expert demonstrations/prior expert data/prior agent data* to utilize for learning.\n",
    ">\n",
    ">---"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "hidden": true,
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [],
   "source": [
    "# Define parameters\n",
    "\n",
    "exp_params = {\n",
    "    'exp_name': 'InvertedPendulum_to_colored/DisentanGAIL',\n",
    "    'expert_samples_location': 'expert_data',\n",
    "    'prior_samples_location': 'prior_data',\n",
    "    'env_name': 'InvertedPendulum-v2',\n",
    "    'env_type': 'colored',\n",
    "    'exp_num': -1,\n",
    "    'epochs': 20,\n",
    "    'test_runs_per_epoch': 5,\n",
    "    'steps_per_epoch': 1000,\n",
    "    'init_random_samples': 5000,\n",
    "    'training_starts': 512,\n",
    "    'episode_limit': 50,\n",
    "    'return_threshold': 50.,\n",
    "    'plot_sanity_check_samples': True,\n",
    "}\n",
    "learner_params = {\n",
    "    'l_type': 'SAC',\n",
    "    'l_buffer_size': 10000,\n",
    "    'l_exploration_noise': 0.2,\n",
    "    'l_learning_rate': 1e-3,\n",
    "    'l_batch_size': 256,\n",
    "    'l_updates_per_step': 1,\n",
    "    'l_act_delay': 1,\n",
    "    'l_gamma': 0.99,\n",
    "    'l_polyak': 0.995,\n",
    "    'l_train_actor_noise': 0.1,\n",
    "    'l_entropy_coefficient': 0.2,\n",
    "    'l_tune_entropy_coefficient': True,\n",
    "    'l_target_entropy': None,\n",
    "    'l_clip_actor_gradients': False,\n",
    "    'l_weighted_buffer': False,\n",
    "    'l_weighted_buffer_baseline': False,\n",
    "    'l_expert_proportion': 0.5,\n",
    "    'l_expert_percentile': 0.9,\n",
    "}\n",
    "\n",
    "discriminator_params = {\n",
    "    'd_type': 'latent',\n",
    "    'd_rew_noise': False,\n",
    "    'd_learning_rate': 1e-3,\n",
    "    'd_mi_learning_rate': 1e-3,\n",
    "    'd_updates_per_step': 1,\n",
    "    'd_mi_updates_per_step': 1,\n",
    "    'd_e_batch_size': 128,\n",
    "    'd_l_batch_size': 128,\n",
    "    'd_stability_constant':  1e-7,\n",
    "    'd_sn_discriminator': True,\n",
    "    'd_pre_filters': [16, 16, 1],\n",
    "    'd_hidden_units': [32, 32],\n",
    "    'd_mi_hidden_units': [32, 32],\n",
    "    'd_mi_constant': 0.5,\n",
    "    'd_adaptive_mi': True,\n",
    "    'd_double_mi': True,\n",
    "    'd_use_min_double_mi': True,\n",
    "    'd_max_mi': 0.99,\n",
    "    'd_min_mi': 0.99/2,\n",
    "    'd_mi_lagrangian_lr': 1e-3,\n",
    "    'd_max_mi_constant': 5.0,\n",
    "    'd_min_mi_constant': 1e-4,\n",
    "    'd_unbiased_mi': True,\n",
    "    'd_unbiased_mi_decay': .99,\n",
    "    'd_prior_mi_constant': 1.0,\n",
    "    'd_min_mi_prior_constant': 1e-3,\n",
    "    'd_max_mi_prior': 0.001,\n",
    "    'd_clip_mi_predictions': True,\n",
    "    'd_pre_scale_stddev': 0.5,\n",
    "    'd_negative_priors': True,\n",
    "    'n_expert_demos': 10000,\n",
    "    'n_expert_prior_demos': 10000,\n",
    "    'n_agent_prior_demos': 10000,\n",
    "}\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "hidden": true,
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [],
   "source": [
    "# Run 5 repetitions of the experiment\n",
    "\n",
    "for i in range(5):\n",
    "    exp_params['exp_num'] = i\n",
    "    gail, sampler = run_experiment(exp_params, learner_params, discriminator_params)\n",
    "    \n",
    "    # uncomment below to visualize the learnt behaviour at the end of each experiment\n",
    "    \n",
    "    # sampler.sample_test_trajectories(gail._agent, 0.0, 1, True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "heading_collapsed": true
   },
   "source": [
    "### *Domain confusion loss* models"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "hidden": true
   },
   "outputs": [],
   "source": [
    "from run_experiment_dc_loss import run_experiment"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "hidden": true
   },
   "source": [
    "**Replicating the *domain confusion loss* results**\n",
    "\n",
    "We provide a function that simply allows to collect data with different algorithms making use of our implementation of the *domain confusion loss*.\n",
    "\n",
    "The parameters for this function are similar to the *DisentanGAIL* parameters, described above. Please, refer to our code (mainly *dc_loss.py*) for further details.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "hidden": true
   },
   "outputs": [],
   "source": [
    "# Define parameters\n",
    "\n",
    "exp_params = {\n",
    "    'exp_name': 'InvertedPendulum_to_colored/dc_loss',\n",
    "    'expert_samples_location': 'expert_data',\n",
    "    'prior_samples_location': 'prior_data',\n",
    "    'env_name': 'InvertedPendulum-v2',\n",
    "    'env_type': 'colored',\n",
    "    'exp_num': -1,\n",
    "    'epochs': 20,\n",
    "    'test_runs_per_epoch': 5,\n",
    "    'steps_per_epoch': 1000,\n",
    "    'init_random_samples': 5000,\n",
    "    'training_starts': 512,\n",
    "    'episode_limit': 50,\n",
    "    'return_threshold': 50.,\n",
    "    'plot_sanity_check_samples': True,\n",
    "}\n",
    "learner_params = {\n",
    "    'l_type': 'SAC',\n",
    "    'l_buffer_size': 10000,\n",
    "    'l_learning_rate': 1e-3,\n",
    "    'l_batch_size': 256,\n",
    "    'l_updates_per_step': 1,\n",
    "    'l_act_delay': 1,\n",
    "    'l_gamma': 0.99,\n",
    "    'l_polyak': 0.995,\n",
    "    'l_train_actor_noise': 0.1,\n",
    "    'l_entropy_coefficient': 0.2,\n",
    "    'l_tune_entropy_coefficient': True,\n",
    "    'l_target_entropy': None,\n",
    "    'l_clip_actor_gradients': False,\n",
    "    'l_weighted_buffer': False,\n",
    "    'l_weighted_buffer_baseline': False,\n",
    "    'l_expert_proportion': 0.5,\n",
    "    'l_expert_percentile': 0.9,\n",
    "}\n",
    "\n",
    "discriminator_params = {\n",
    "    'd_type': 'latent',\n",
    "    'd_domain_constant': 0.25,\n",
    "    'd_rew_noise': False,\n",
    "    'd_learning_rate': 1e-3,\n",
    "    'd_updates_per_step': 1,\n",
    "    'd_e_batch_size': 128,\n",
    "    'd_l_batch_size': 128,\n",
    "    'd_stability_constant':  1e-7,\n",
    "    'd_sn_discriminator': True,\n",
    "    'd_pre_filters': [16, 16, 1],\n",
    "    'd_hidden_units': [100, 100],\n",
    "    'd_use_prior_data': True,\n",
    "    'd_pre_scale_stddev': 0.5,\n",
    "    'n_expert_demos': 10000,\n",
    "    'n_expert_prior_demos': 10000,\n",
    "    'n_agent_prior_demos': 10000,\n",
    "}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "hidden": true
   },
   "outputs": [],
   "source": [
    "# Run 5 repetitions of the experiment\n",
    "\n",
    "for i in range(5):\n",
    "    exp_params['exp_num'] = i\n",
    "    gail, sampler = run_experiment(exp_params, learner_params, discriminator_params)\n",
    "    \n",
    "    # uncomment below to visualize the learnt behaviour at the end of each experiment\n",
    "    \n",
    "    # sampler.sample_test_trajectories(gail._agent, 0.0, 1, True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "hidden": true
   },
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}
