{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "pycharm": {}
   },
   "source": [
    "# Tutorial 05 - Settings Optimizing\n",
    "\n",
    "Similar to other learning-based approaches, model hyperparameters play an important role in RL. Likewise, environment configurations such as observation items or reward weights affect the outcome of RL substantially. It is therefore critical to enable a finetuning process for these different settings. We offer now three separate options to choose from:\n",
    "* optimize observation configurations\n",
    "* optimize reward configurations\n",
    "* optimize model hyperparameters\n",
    "\n",
    "In CommonRoad-RL, these are achieved with the [Optuna](https://optuna.org) package. Essentially, with Optuna's interfaces, an optimization process involves several rounds of learning processes, or \"trials\" as called in the package, each being triggered with a set of sampled configurations/hyperparameters and behaving exactly the same as a vanilla learning process. Finally, with several rounds of learning conducted, the best performing set of configurations/hyperparameters will be reported. Note that there exist different criteria for configurations and hyperparameters optimizing. A more detailed description can be found in `commonroad_rl/README.md`.     \n",
    "\n",
    "We show in this tutorial an example to optimize reward configurations. The interfaces to optimize observation configurations and model hyperparamters are similar."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "pycharm": {}
   },
   "source": [
    "## 0. Preparation\n",
    "\n",
    "Please make sure the training and testing data are prepared, otherwise see **Tutorial 01 - Data Preprocessing**. It is highly recommended that both **Tutorial 02 - Vanilla Learning** and **Tutorial 03 - Continual Learning** are completed first. Also, check the followings:\n",
    "* current path is at the project root `commonroad-rl`, i.e. two upper layers to the `tutorials` folder\n",
    "* interactive python kernel is triggered from the correct environment"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "pycharm": {}
   },
   "outputs": [],
   "source": [
    "# Check current path\n",
    "%cd ../..\n",
    "%pwd\n",
    "\n",
    "# Check interactive python kernel\n",
    "import sys\n",
    "sys.executable"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "pycharm": {}
   },
   "source": [
    "## 1. Load RL environment and model settings\n",
    "\n",
    "Similar to a vanilla learning process, we have to specify the environment configurations and model hyperparameters beforehands. However, a major difference for configurations optimization is that besides assigning direct values, we also prepare a set of sampling settings so that later the optimizer knows what sampling method and what sampling range/candidates are used for each item. Please see `commonroad_rl/config.yaml` for details.  "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "pycharm": {}
   },
   "outputs": [],
   "source": [
    "import os\n",
    "import yaml\n",
    "import copy\n",
    " \n",
    "# Read in environment configurations \n",
    "env_configs = {}\n",
    "with open(\"commonroad_rl/gym_commonroad/configs.yaml\", \"r\") as config_file:\n",
    "    env_configs = yaml.safe_load(config_file)[\"env_configs\"]\n",
    "\n",
    "# Save settings for later use\n",
    "log_path = \"commonroad_rl/tutorials/logs/\"\n",
    "os.makedirs(log_path, exist_ok=True)\n",
    "\n",
    "with open(os.path.join(log_path, \"environment_configurations.yml\"), \"w\") as config_file:\n",
    "    yaml.dump(env_configs, config_file)\n",
    "\n",
    "# Read in model hyperparameters\n",
    "hyperparams = {}\n",
    "with open(\"commonroad_rl/hyperparams/ppo2.yml\", \"r\") as hyperparam_file:\n",
    "    hyperparams = yaml.safe_load(hyperparam_file)[\"commonroad-v0\"]\n",
    "    \n",
    "# Save settings for later use\n",
    "with open(os.path.join(log_path, \"model_hyperparameters.yml\"), \"w\") as hyperparam_file:\n",
    "    yaml.dump(hyperparams, hyperparam_file)\n",
    "\n",
    "# Remove `normalize` as it will be handled explicitly later\n",
    "if \"normalize\" in hyperparams:\n",
    "    del hyperparams[\"normalize\"]\n",
    "    \n",
    "# Read in sampling settings for reward configurations\n",
    "sampling_settings_reward_configs = {}\n",
    "with open(\"commonroad_rl/gym_commonroad/configs.yaml\", \"r\") as config_file:\n",
    "    sampling_settings_reward_configs = yaml.safe_load(config_file)[\"sampling_setting_reward_configs\"]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "pycharm": {}
   },
   "source": [
    "## 2. Create training and testing environments\n",
    "\n",
    "Likewise, training and testing environments are to be prepared for settings optimizing. However, since we are going to sample and optimize the reward configurations later, we do not create the environments directly at this point, but a callable function and pass it to the optimizer interface from the [Optuna](https://optuna.org) package."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "pycharm": {}
   },
   "outputs": [],
   "source": [
    "import gym\n",
    "from stable_baselines.bench import Monitor\n",
    "from stable_baselines.common.vec_env import DummyVecEnv, VecNormalize\n",
    "\n",
    "import commonroad_rl.gym_commonroad\n",
    "\n",
    "# Prepare a low-level environment maker\n",
    "meta_scenario_path = \"commonroad_rl/tutorials/data/highD/pickles/meta_scenario\"\n",
    "training_data_path = \"commonroad_rl/tutorials/data/highD/pickles/problem_train\"\n",
    "testing_data_path = \"commonroad_rl/tutorials/data/highD/pickles/problem_test\"\n",
    "\n",
    "def make_env(**env_kwargs):\n",
    "    def _func():\n",
    "        # Create the environment\n",
    "        env = gym.make(\"commonroad-v1\", \n",
    "                       meta_scenario_path=meta_scenario_path,\n",
    "                       train_reset_config_path= training_data_path,\n",
    "                       test_reset_config_path= testing_data_path,\n",
    "                       **env_kwargs)\n",
    "\n",
    "        # Wrap the environment with a monitor to keep an record of the learning process\n",
    "        info_keywords=tuple([\"is_collision\", \\\n",
    "                             \"is_time_out\", \\\n",
    "                             \"is_off_road\", \\\n",
    "                             \"is_friction_violation\", \\\n",
    "                             \"is_goal_reached\"])\n",
    "        env = Monitor(env, log_path + \"infos\", info_keywords=info_keywords)\n",
    "        return env\n",
    "    return _func\n",
    "\n",
    "# Prepare a callable function for the optimizer\n",
    "def create_env(**env_kwargs):\n",
    "    # Vectorize the environment\n",
    "    env = DummyVecEnv([make_env(**env_kwargs)])\n",
    "\n",
    "    # Normalize observations and rewards as required\n",
    "    if \"test_env\" not in env_kwargs or env_kwargs[\"test_env\"] is False:\n",
    "        # Normalize observations and rewards during training\n",
    "        env = VecNormalize(env, norm_obs=True, norm_reward=True)\n",
    "    else:\n",
    "        # Normalize only observations during testing\n",
    "        env = VecNormalize(env, norm_obs=True, norm_reward=False)\n",
    "\n",
    "    return env\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "pycharm": {}
   },
   "source": [
    "## 3. Create a model\n",
    "\n",
    "In addition, we do not create a model explicitly but prepare a callable function and pass it to the optimizer later."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "pycharm": {}
   },
   "outputs": [],
   "source": [
    "from stable_baselines import PPO2\n",
    "\n",
    "def create_model(hyperparams, env_configs):\n",
    "    return PPO2(env=create_env(**env_configs), **hyperparams)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "pycharm": {}
   },
   "source": [
    "## 4. Assemble an evaluation callback with optimization criteria\n",
    "\n",
    "During an optimization process, it is important to assess how good or bad a set of sampled configurations has performed. This is done with an evaluation callback, which will be appended to every learning trial. In `commonroad_rl/utils_run/callbacks.py`, there are specific callback functions defined for reward configurations, observation configurations, and model hyperparameters. In the following, we show how the callback for reward configurations are established for later use."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "pycharm": {}
   },
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "from stable_baselines.common.callbacks import EvalCallback\n",
    "from stable_baselines.common.vec_env import sync_envs_normalization, VecEnv\n",
    "\n",
    "class RewardConfigsTrialEvalCallback(EvalCallback):\n",
    "    def __init__(\n",
    "        self,\n",
    "        eval_env,\n",
    "        trial,\n",
    "        n_eval_episodes=5,\n",
    "        eval_freq=10000,\n",
    "        log_path=None,\n",
    "        best_model_save_path=None,\n",
    "        deterministic=True,\n",
    "        verbose=1,\n",
    "    ):\n",
    "        super(RewardConfigsTrialEvalCallback, self).__init__(\n",
    "            eval_env=eval_env,\n",
    "            n_eval_episodes=n_eval_episodes,\n",
    "            eval_freq=eval_freq,\n",
    "            deterministic=deterministic,\n",
    "            verbose=verbose,\n",
    "        )\n",
    "        self.trial = trial\n",
    "        self.eval_idx = 0\n",
    "        self.is_pruned = False\n",
    "        self.lowest_mean_cost = np.inf\n",
    "        self.last_mean_cost = np.inf\n",
    "        self.cost = 0.0\n",
    "\n",
    "        # Save best model into `($best_model_save_path)/trial_($trial_number)/best_model.zip`\n",
    "        if best_model_save_path is not None:\n",
    "            self.best_model_save_path = os.path.join(\n",
    "                best_model_save_path, \"trial_\" + str(trial.number), \"best_model\"\n",
    "            )\n",
    "            os.makedirs(self.best_model_save_path, exist_ok=True)\n",
    "        else:\n",
    "            self.best_model_save_path = best_model_save_path\n",
    "\n",
    "        # Log evaluation information into `($log_path)/trial_($trial_number)/evaluations.npz`\n",
    "        self.evaluation_timesteps = []\n",
    "        self.evaluation_costs = []\n",
    "        self.evaluation_lengths = []\n",
    "        if log_path is not None:\n",
    "            self.log_path = os.path.join(\n",
    "                log_path, \"trial_\" + str(trial.number), \"evaluations\"\n",
    "            )\n",
    "            os.makedirs(os.path.dirname(self.log_path), exist_ok=True)\n",
    "        else:\n",
    "            self.log_path = log.path\n",
    "\n",
    "    def _on_step(self):\n",
    "        if self.eval_freq > 0 and self.n_calls % self.eval_freq == 0:\n",
    "            def evaluate_policy_configs(\n",
    "                model,\n",
    "                env,\n",
    "                n_eval_episodes=10,\n",
    "                render=False,\n",
    "                deterministic=True,\n",
    "                callback=None,\n",
    "            ):\n",
    "                \"\"\"\n",
    "                Runs policy for `n_eval_episodes` episodes and returns cost for optimization.\n",
    "                This is made to work only with one env.\n",
    "\n",
    "                :param model: (BaseRLModel) The RL agent you want to evaluate.\n",
    "                :param env: (gym.Env or VecEnv) The gym environment. In the case of a `VecEnv`, this must contain only one environment.\n",
    "                :param n_eval_episodes: (int) Number of episode to evaluate the agent\n",
    "                :param deterministic: (bool) Whether to use deterministic or stochastic actions\n",
    "                :param render: (bool) Whether to render the environment or not\n",
    "                :param callback: (callable) callback function to do additional checks, called after each step.\n",
    "                :return: ([float], [int]) list of episode costs and lengths\n",
    "                \"\"\"\n",
    "                if isinstance(env, VecEnv):\n",
    "                    assert (\n",
    "                        env.num_envs == 1\n",
    "                    ), \"You must pass only one environment when using this function\"\n",
    "\n",
    "                episode_costs = []\n",
    "                episode_lengths = []\n",
    "                for _ in range(n_eval_episodes):\n",
    "                    obs = env.reset()\n",
    "                    done, info, state = False, None, None\n",
    "\n",
    "                    # Record required information\n",
    "                    # Since vectorized environments get reset automatically after each episode,\n",
    "                    # we have to keep a copy of the relevant states here.\n",
    "                    # See https://stable-baselines.readthedocs.io/en/master/guide/vec_envs.html for more details.\n",
    "                    episode_length = 0\n",
    "                    episode_cost = 0.0\n",
    "                    episode_is_time_out = []\n",
    "                    episode_is_collision = []\n",
    "                    episode_is_off_road = []\n",
    "                    episode_is_goal_reached = []\n",
    "                    episode_is_friction_violation = []\n",
    "                    while not done:\n",
    "                        action, state = model.predict(\n",
    "                            obs, state=state, deterministic=deterministic\n",
    "                        )\n",
    "                        obs, reward, done, info = env.step(action)\n",
    "\n",
    "                        episode_length += 1\n",
    "                        episode_is_time_out.append(info[-1][\"is_time_out\"])\n",
    "                        episode_is_collision.append(info[-1][\"is_collision\"])\n",
    "                        episode_is_off_road.append(info[-1][\"is_off_road\"])\n",
    "                        episode_is_goal_reached.append(info[-1][\"is_goal_reached\"])\n",
    "                        episode_is_friction_violation.append(info[-1][\"is_friction_violation\"])\n",
    "\n",
    "                        if callback is not None:\n",
    "                            callback(locals(), globals())\n",
    "                        if render:\n",
    "                            env.render()\n",
    "\n",
    "                    # Calculate cost for optimization from state information\n",
    "                    normalized_episode_length = (\n",
    "                        episode_length / info[-1][\"max_episode_time_steps\"]\n",
    "                    )\n",
    "                    if episode_is_time_out[-1]:\n",
    "                        episode_cost += 10.0 * (1 / normalized_episode_length)\n",
    "                    if episode_is_collision[-1]:\n",
    "                        episode_cost += 10.0 * (1 / normalized_episode_length)\n",
    "                    if episode_is_off_road[-1]:\n",
    "                        episode_cost += 10.0 * (1 / normalized_episode_length)\n",
    "                    if episode_is_friction_violation[-1]:\n",
    "                        episode_cost += 10.0 * (1 / normalized_episode_length)\n",
    "                    if episode_is_goal_reached[-1]:\n",
    "                        episode_cost -= 10.0 * normalized_episode_length\n",
    "\n",
    "                    episode_costs.append(episode_cost)\n",
    "                    episode_lengths.append(episode_length)\n",
    "\n",
    "                return episode_costs, episode_lengths\n",
    "\n",
    "            sync_envs_normalization(self.training_env, self.eval_env)\n",
    "            episode_costs, episode_lengths = evaluate_policy_configs(\n",
    "                self.model,\n",
    "                self.eval_env,\n",
    "                n_eval_episodes=self.n_eval_episodes,\n",
    "                render=self.render,\n",
    "                deterministic=self.deterministic,\n",
    "            )\n",
    "\n",
    "            mean_cost, std_cost = np.mean(episode_costs), np.std(episode_costs)\n",
    "            mean_length, std_length = np.mean(episode_lengths), np.std(episode_lengths)\n",
    "            self.last_mean_cost = mean_cost\n",
    "\n",
    "            if self.verbose > 0:\n",
    "                print(\"Evaluating at learning time step: {}\".format(self.num_timesteps))\n",
    "                print(\"Cost mean: {:.2f}, std: {:.2f}\".format(mean_cost, std_cost))\n",
    "                print(\"Length mean: {:.2f}, std: {:.2f}\".format(mean_length, std_length))\n",
    "\n",
    "            if self.log_path is not None:\n",
    "                self.evaluation_timesteps.append(self.num_timesteps)\n",
    "                self.evaluation_costs.append(episode_costs)\n",
    "                self.evaluation_lengths.append(episode_lengths)\n",
    "                np.savez(\n",
    "                    self.log_path,\n",
    "                    timesteps=self.evaluation_timesteps,\n",
    "                    episode_costs=self.evaluation_costs,\n",
    "                    episode_lengths=self.evaluation_lengths,\n",
    "                )\n",
    "\n",
    "            if mean_cost < self.lowest_mean_cost:\n",
    "                self.lowest_mean_cost = mean_cost\n",
    "                if self.best_model_save_path is not None:\n",
    "                    self.model.save(self.best_model_save_path)\n",
    "                # Trigger callback if needed\n",
    "                if self.callback is not None:\n",
    "                    return self._on_event()\n",
    "\n",
    "            # Report trial results\n",
    "            self.eval_idx += 1\n",
    "            self.cost = self.lowest_mean_cost\n",
    "            self.trial.report(self.cost, self.eval_idx)\n",
    "            # Prune trial if need\n",
    "            if self.trial.should_prune():\n",
    "                self.is_pruned = True\n",
    "                return False\n",
    "        return True"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "pycharm": {}
   },
   "source": [
    "## 5. Formalize the optimization objective and process\n",
    "\n",
    "Also different from a regular learning process, we have to define an objective function which helps sample the set of configurations to be used in the upcoming trial, call the functions for the model, environments and callbacks, and start a trial."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "pycharm": {}
   },
   "outputs": [],
   "source": [
    "def objective_reward_configs(trial):\n",
    "    # Sample reward configurations according to settings\n",
    "    sampled_reward_configs = {}\n",
    "    for key, value in sampling_settings_reward_configs['reward_configs_hybrid'].items():\n",
    "        method, interval = next(iter(value.items()))\n",
    "        if method == \"categorical\":\n",
    "            sampled_reward_configs[key] = trial.suggest_categorical(key, interval)\n",
    "        elif method == \"uniform\":\n",
    "            sampled_reward_configs[key] = trial.suggest_uniform(key, interval[0], interval[1])\n",
    "        elif method == \"loguniform\":\n",
    "            sampled_reward_configs[key] = trial.suggest_loguniform(key, interval[0], interval[1])\n",
    "        else:\n",
    "            print(\"Sampling method \" + method + \" not supported for \" + key)\n",
    "    \n",
    "    # Update environment configurations\n",
    "    env_configs.update({'reward_configs_hybrid' : sampled_reward_configs})\n",
    "    \n",
    "    # Save data for later inspection\n",
    "    tmp_path = os.path.join(log_path, \"trial_\" + str(trial.number))\n",
    "    os.makedirs(tmp_path, exist_ok=True)\n",
    "    with open(os.path.join(tmp_path, \"environment_configurations.yml\"), \"w\") as f:\n",
    "        yaml.dump(env_configs, f)\n",
    "    \n",
    "    model = create_model(hyperparams, env_configs)\n",
    "    testing_env = create_env(test_env=True, **env_configs)\n",
    "\n",
    "    reward_configs_eval_callback = RewardConfigsTrialEvalCallback(testing_env,\n",
    "                                                                  trial,\n",
    "                                                                  n_eval_episodes=3,\n",
    "                                                                  eval_freq=500,\n",
    "                                                                  log_path=log_path,\n",
    "                                                                  best_model_save_path=log_path,\n",
    "                                                                  deterministic=True,\n",
    "                                                                  verbose=1)\n",
    "    # Conduct a learning trial\n",
    "    try:\n",
    "        n_timesteps = 3000\n",
    "        model.learn(n_timesteps, callback=reward_configs_eval_callback)\n",
    "        # Free memory\n",
    "        model.env.close()\n",
    "        testing_env.close()\n",
    "    # Catch NaN from bad random configurations\n",
    "    except AssertionError:\n",
    "        # Free memory\n",
    "        model.env.close()\n",
    "        testing_env.close()\n",
    "        raise optuna.exceptions.TrialPruned()\n",
    "    \n",
    "    # Record trial results\n",
    "    is_pruned = reward_configs_eval_callback.is_pruned\n",
    "    cost = reward_configs_eval_callback.cost\n",
    "    del model.env, testing_env\n",
    "    del model\n",
    "    if is_pruned:\n",
    "        raise optuna.exceptions.TrialPruned()\n",
    "    return cost"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "pycharm": {}
   },
   "source": [
    "## 6. Trigger the optimization process and save the best\n",
    "\n",
    "Finally, we are ready to trigger the optimization process. For such, we first create a `study` object from the Optuna package with the designated sampler and pruner, and then call the `optimize` member function to start the overall process, passing in the objective function defined above. Please see the [Optuna examples](https://optuna.org/#code_examples) for details."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "pycharm": {}
   },
   "outputs": [],
   "source": [
    "import optuna\n",
    "from optuna.samplers import RandomSampler\n",
    "from optuna.pruners import MedianPruner\n",
    "\n",
    "# Create a study object on reward configurations from Optuna\n",
    "reward_configs_study = optuna.create_study(sampler=RandomSampler(), pruner=MedianPruner())\n",
    "\n",
    "# Start optimizing\n",
    "reward_configs_study.optimize(objective_reward_configs, n_trials=5, n_jobs=1)\n",
    "\n",
    "# Access and record the best performing set of configurations after all trials\n",
    "with open(os.path.join(log_path, \"report_reward_configs_study.yaml\"), \"w\") as f:\n",
    "    yaml.dump(reward_configs_study.best_trial.params, f)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "pycharm": {}
   },
   "source": [
    "Now in `tutorials/logs`, there should be a resulting best `.yaml` file and several `trial_*` folders recording the information for each of the optimization trials. These are useful for inspection and reuse in other subsequent learnings. For a detailed diretory description, please refer to the `commonroad_rl/README.md` file."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "pycharm": {}
   },
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "@webio": {
   "lastCommId": null,
   "lastKernelId": null
  },
  "kernelspec": {
   "display_name": "Python (cr37_release)",
   "language": "python",
   "name": "cr37-release"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.10"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
