# General Info
This codebase contains offline and deployment-efficient RL implementation using SAC, BCQ and BRAC, forked from [Behavior Regularized Offline Reinforcement Learning](https://github.com/google-research/google-research/tree/master/behavior_regularized_offline_rl).
Please take a look at [original README](README_original.md) if you need information on the setup.

## Changing OpenAI Gym Environment
We made a modification on the environmental configuration file for fair comparison between model-free and model-based offline RL method, and addition of`CheetahRun-v0` experiment.

Install gym package equipped with this supplemental material instead of original one.

# Reproducing Results
## Data-Collection with Online SAC
Use `train_online.py` and set option as `--agent_name=sac`. 

This is the same as [original BRAC implementation](README_original.md).


## Offline BCQ and BRAC
Use `train_offline.py` and set option as `--agent_name=bcq` and `--agent_name=brac_primal` respectively, see [original BRAC implementation](README_original.md).


## Deployment-Efficient RL with SAC
Use `train_online.py` and set option as `--agent_name=sac_recursive`. For example,
```shell script
ENV_NAME=Ant-v2
SUB_DIR=subdir
SEED=$seed
TOTAL_TRAIN_STEPS=50000
DATA_COLLECTION_FREQ=10000
DATA_COLLECTION_STEPS=100000
BUFFER_SIZE=500000
ROOT_DIR='/path/to/your_dir'
python -m train_online \
--root_dir=$ROOT_DIR \
--sub_dir=$SUB_DIR \
--env_name=$ENV_NAME \
--seed=$SEED \
--agent_name=sac_recursive \
--total_train_steps=$TOTAL_TRAIN_STEPS \
--data_collection_freq=$DATA_COLLECTION_FREQ \
--data_collection_steps=$DATA_COLLECTION_STEPS \
--eval_freq=1000 \
--gin_bindings="train_eval_online.replay_buffer_size=$BUFFER_SIZE" \
--gin_bindings="train_eval_online.model_params=(((300, 300), (200, 200),), 2)" \
--gin_bindings="train_eval_online.batch_size=256" \
--gin_bindings="train_eval_online.optimizers=(('adam', 0.0005),)"
```
The result will be saved in `/path/to/your_dir/{env_name}/{agent_name}/{sub_dir}/{seed}`.

- `env_name`: one of `Ant-v2`, `HalfCheetah-v2`, `Hopper-v2`, `Walker2d-v2`, and `CheetahRun-v0`
- `root_dir`: root directory where you save the result and model
- `sub_dir`: name of subdirectory of the experiment
- `seed`: random seed
- `total_train_steps`: the number of training steps
- `data_collection_freq`: the number of training steps between deployments
  - therefore, the number of deployments is `int(total_train_steps/data_collection_freq) + 1`
- `data_collection_steps`: the number of transitions collected per deployment


## Deployment-Efficient RL with BCQ
Use `train_online.py` and set option as `--agent_name=bcq_recursive`. For example,
```shell script
PHI=0.15
PLR=3e-05
ENV_NAME=Ant-v2
SUB_DIR=subdir
SEED=0
EVAL_TARGET=9000
TOTAL_TRAIN_STEPS=50000
DATA_COLLECTION_FREQ=10000
DATA_COLLECTION_STEPS=100000
BUFFER_SIZE=500000
ROOT_DIR='/path/to/your_dir'
python -m train_online \
--root_dir=$ROOT_DIR \
--sub_dir=$SUB_DIR \
--env_name=$ENV_NAME \
--eval_target=$EVAL_TARGET \
--seed=$SEED \
--agent_name=bcq_recursive \
--total_train_steps=$TOTAL_TRAIN_STEPS \
--data_collection_freq=$DATA_COLLECTION_FREQ \
--data_collection_steps=$DATA_COLLECTION_STEPS \
--eval_freq=1000 \
--gin_bindings="train_eval_online.replay_buffer_size=$BUFFER_SIZE" \
--gin_bindings="train_eval_online.model_params=(((300, 300), (300, 300), (750, 750)), 2, $PHI)" \
--gin_bindings="train_eval_online.batch_size=256" \
--gin_bindings="train_eval_online.optimizers=(('adam', 1e-3), ('adam', $PLR), ('adam', 1e-3))"
```

## Deployment-Efficient RL with BRAC
Use `train_recursive_brac.py` and set option as `--agent_name=brac_recursive`. For example,
```shell script
ENV_NAME=Ant-v2
SUB_DIR=subdir
SEED=0
ALPHA=0.3
PLR=3e-5
VALUE_PENALTY=True
TOTAL_TRAIN_STEPS=50000
DATA_COLLECTION_FREQ=10000
DATA_COLLECTION_STEPS=100000
BUFFER_SIZE=500000
ROOT_DIR='/path/to/your_dir'
python -m train_recursive_brac \
--root_dir=$ROOT_DIR \
--sub_dir=$SUB_DIR \
--env_name=$ENV_NAME \
--seed=$SEED \
--total_train_steps=$TOTAL_TRAIN_STEPS \
--data_collection_freq=$DATA_COLLECTION_FREQ \
--data_collection_steps=$DATA_COLLECTION_STEPS \
--eval_freq=1000 \
--gin_bindings="brac_primal_agent.Agent.alpha=$ALPHA" \
--gin_bindings="brac_primal_agent.Agent.value_penalty=$VALUE_PENALTY" \
--gin_bindings="train_eval_recursive_brac.replay_buffer_size=$BUFFER_SIZE" \
--gin_bindings="train_eval_recursive_brac.model_params=(((300, 300), (200, 200),), 2)" \
--gin_bindings="train_eval_recursive_brac.batch_size=256" \
--gin_bindings="train_eval_recursive_brac.optimizers=(('adam', 1e-3), ('adam', $PLR), ('adam', 1e-3))" \
--gin_bindings="train_eval_recursive_brac.bc_train_steps=2000" \
--gin_bindings="train_eval_recursive_brac.bc_model_params=((200, 200),)" \
--gin_bindings="train_eval_recursive_brac.bc_batch_size=256" \
--gin_bindings="train_eval_recursive_brac.bc_optimizers=(('adam', 5e-4),)" 
```

