# Provably Safe Reinforcement Learning: Conceptual Analysis, Survey, and Benchmarking

## Prerequisites

This benchmark is developed for Python 3.8 (Ubuntu 20.04). We assume that [conda](https://docs.conda.io/projects/conda/en/latest/user-guide/install/index.html) is installed. 

## Installation

```bash
sudo apt-get install libgmp3-dev
```
Please clone all subrepos
```bash
git submodule init
git submodule update --recursive
```

```bash
#Note: Not all required packages are available in conda-forge channels.
conda env create -f environment.yml
source activate safe_rl_benchmark
pip install -e .
cd provably_safe_benchmark/external/provably-safe-env
pip install -e .
```

## Run an experiment
Pendulum:
```bash
python provably_safe_benchmark/benchmark/benchmark_pendulum.py --f 0
```
2D Quadrotor:
```bash
python provably_safe_benchmark/benchmark/benchmark_2D_quadrotor.py --f 0
```
Select f between 0 and 64. Please see `experiment_ids.xlsx` for further explanation!

The benchmark consists of 65 configurations. Each configuration specifies the learning algorithm, the action space, the provably safe RL method and the learning tuple:
- Algorithm
    - DQN
    - TD3
    - PPO
    - SAC
- ActionSpace
    - Discrete
    - Continuous
- Approach
    - Baseline
    - Sample (replaces unsafe actions with uniformly sampled safe actions)
    - FailSafe (replaces unsafe actions with safe LQR actions)
    - Projection (projects unsafe actions to the closest safe action using CBFs)
    - Masking (agent choose solely from the safe action set)
- TransitionTuple
    - Naive: (s, a, s′, r)
    - AdaptionPenalty: (s, a, s′, r*)
    - SafeAction: (s, a_phi, s′, r)
    - Both: (s, a, s′, r*) + (s, a_phi, s′, r)

By default, the benchmark trains `TRAIN_ITERS=1` models for each valid configuration. Every model is trained for `STEPS=60_000` steps for the pendulum and `STEPS=200_000` for the 2D Quadrotor. Afterwards, each model is deployed for `MODEL_ITERS=1` episodes.

One run results in:
- tensorboard logs for training and deployment stored in `tensorboard/`
- trained model stored in `models/`
- csv file for deployment and training with mean and deviation for most important metrics (reward, safety violation, intervention rate) stored in `data/`

### Run specified in CodeOcean

To show the repoducibility but also incorporate the resource constraints from CommonOcean, the run file reproduced one trainign and deployment run for TD3 for eight pendulum configurations and one quadrotor configuration. In particular,
- pendulum & quadrotor: TD3 baselines
- pendulum: TD3 action replacement with uniform sampling strategy and all 4 learning tuples
- pendulum: TD3 action replacment with failsafe controller and naive tuple
- pendulum: TD3 action projection with naive tuple
- pendulum: TD3 action masking (cont.) with naive tuple

### Run all experiments
**Note: for the results reported in the paper the (Gurobi solver)[https://support.gurobi.com/hc/en-us] was used.** 

You can run `./tmux_pendulum.sh` or `./tmux_2D_quadrotor.sh`, which starts independent and persistent TMux sessions for each experiment configuration (**requires 65 cores / hardware threads**). Lastly, you can run custom configurations by adjusting `benchmark.py` accordingly (see `__main__` top-level code). For each episode, `benchmark.py` generates tensorboard logs in `tensorboard/`. The trained models are saved in `models/`. Moreover, `.csv` files in `data/` store the (smoothed) mean and standard deviation of `env_reward`, `is_safety_violation` and `safety_activity` (see 'Tensorboard Tags') during training (averaged over all `TRAIN_ITERS`). Set `TRAIN_ITERS=10` and `MODEL_ITERS=3` to reproduce the results reported in the paper. Note that `safety_activity` was renamed to intervention rate in the paper.

## Structure
```
./
├── hyperparams
│   ├── hyperparams_2D_quadrotor.yml
│   └── hyperparams_pendulum.yml
├── matlab
├── provably_safe_benchmark
│   ├── benchmark
│   │   ├── benchmark_2D_quadrotor.py
│   │   └── benchmark_pendulum.py
│   ├── callbacks
│   │   ├── deploy_pendulum_callback.py
│   │   ├── deploy_quadrotor_callback.py
│   │   ├── train_pendulum_callback.py
│   │   └── train_quadrotor_callback.py
│   ├── external
│   │   ├── provably-safe-env
│   │   │   ├── provably_safe_env
│   │   │   │   ├── demos
│   │   │   │   │   └── demo_long_quadrotor_env.py
│   │   │   │   ├── envs
│   │   │   │   │   ├── __init__.py
│   │   │   │   │   ├── long_quadrotor_env.py
│   │   │   │   │   └── simple_pendulum_env.py
│   │   │   │   └── __init__.py
│   │   │   ├── README.md
│   │   │   └── setup.py
│   ├── stable-baselines3-contrib
│   │   └── sb3_contrib
│   │   ├── common
│   │   │   ├── __init__.py
│   │   │   ├── maskable
│   │   │   │   ├── buffers.py
│   │   │   │   ├── callbacks.py
│   │   │   │   ├── distributions.py
│   │   │   │   ├── evaluation.py
│   │   │   │   ├── __init__.py
│   │   │   │   ├── policies.py
│   │   │   │   └── utils.py
│   │   │   ├── safe_region.py
│   │   │   ├── utils.py
│   │   │   └── wrappers
│   │   │       ├── action_masking.py
│   │   │       ├── action_projection.py
│   │   │       ├── action_replacement.py
│   │   │       ├── informer.py
│   │   │       └── __init__.py
│   │   ├── dqn
│   │   │   ├── dqn.py
│   │   │   ├── __init__.py
│   │   │   └── policies.py
│   │   ├── __init__.py
│   │   ├── ppo
│   │   │   ├── __init__.py
│   │   │   ├── policies.py
│   │   │   └── ppo.py
│   │   ├── sac
│   │   │   ├── __init__.py
│   │   │   ├── policies.py
│   │   │   └── sac.py
│   │   ├── td3
│   │   │   ├── __init__.py
│   │   │   ├── policies.py
│   │   │   └── td3.py
│   │   └── tqc
│   ├── __init__.py
│   └── util
│       ├── __init__.py
│       ├── depoly_stats.py
│       ├── training_stats.py
│       ├── test_limit_function.py
│       ├── tictoc.py
│       └── util.py
├── environment.yml
├── experiment_ids.xlsx
├── README.md
├── setup.py
├── tmux_2D_quadrotor.sh
└── tmux_pendulum.sh
```
The learning algorithms are adapted from [Stable Baselines3](https://github.com/DLR-RM/stable-baselines3).

## Additional functionalities

## Tensorboard Tags
```bash
tensorboard --logdir tensorboard
```
Important tags have the global prefix `benchmark_train/` and `benchmark_deploy/`. Supplementary tags have the prefix `benchmark_train_sup/` or `benchmark_deploy_sup/`. The tag prefix `avg_` denotes total measurement divided by episode length. Note that SB3 might log additional information as well. Multiple runs of the same configuration (`TRAIN_ITERS`) are visualized in a sequence. In case of a "too many open files" IO error, inspect individual subdirectories. 

