# Leveraging Behavioral Cloning for Representation Alignment in Cross-Domain Policy Transfer

This is the implementation for the paper "Leveraging Behavioral Cloning for Representation Alignment in Cross-Domain Policy Transfer".

### Videos
Videos of PLP agents and baseline agents are available in `videos/`.

### Installation
We used Python 3.8 and `torch==2.0.1` for our experiment.
First, you need to install PyTorch and mujoco210 in your environment.
We recommend to create a virtual environment when installing packages.

```bash
$ pip install -U pip
$ pip install -e .
$ pip install -r requirements.txt
```

### Download datasets
Download datasets from Google Drive [here](https://drive.google.com/drive/folders/1Cp74lIlIZc0Hx-kHLi5dIEL3wSyDmzD3?usp=sharing) and place them in `./data`.
When you put the dataset, please keep the original directory structure.
- For P2P experiments, you need hdf5 files in the root directory of the drive.
- For P2A experiments, you need files under `ant/` directory in addition to P2P requirements.
- For R2R, you need `robot` directory.
- For V2V-Reach, you need `metaworld/reach-color_simple_3-v2.hdf5`.
- For V2V-Open, you need `metaworld/window-close_4-v2.hdf5` and `metaworld/window-open-v2.hdf5`.
 
### Instructions for running Maze experiments

#### Run Scripts
For experiments in P2P-medium with PLP,
```bash
$ python common/ours/main.py  # for PLP and BC
```

To run baselines, use the following commands:
```bash
$ python common/dail/main.py  # for GAMA
$ python common/cdil/main.py  # for CDIL
$ python common/cond/main.py  # for Contextual
```

The results will be displayed or partly saved in `results/`.

#### Options
For each command you can add
- `device=` for specifying GPU. `cuda:0` is used by default.
- `name=` for specifying run_name
- `config=` for specifying config file of . A config for P2P-medium is used by default.
    - Config files for PLP are provided in `common/ours/config`.
        - `p2p.yaml` is used by default.
        - For P2P-umaze, use `p2p_umaze.yaml`.
        - For P2A-medium, use `p2a.yaml`.
        - For P2A-umaze, use `p2a_umaze.yaml`.
        - For P2P-obs-medium, use `p2p_obs.yaml`.
    - For the baselines, please use files in `common/(method_name)/config/`.
    - For BC baseline, use the file for PLP with `naive_bc=true`.
- `goal=` for specifying the goal ID.
    - We used goal 0, 3, 6 for the umaze, 6, 17, 20 for the medium maze.
      These points are selected from different areas of the maze as much as possible.
- `batch_size=` to increase or decrease the batch size.
- `complex_task=true` for OOD target task.
- `tcc=false` if you want to disable TCC.
- `mmd_coef=0` if you want to disable the mmd regularization.
- `adversarial_coef=0.5` if you want to add the discriminative loss with coefficient 0.5.
- `comet=true` if you want to enable logging with Comet. Set environment variable `COMET_PLP_PROJECT_NAME` appropriately.

For example, if you want to run an experiment of PLP in P2A-medium with goal no. 20 without TCC loss on GPU 1, and log the result with name `test`, run the following command:
```bash
$ python common/ours/main.py config=common/ours/config/p2a.yaml goal=20 tcc=false device=cuda:1 name=test
```
---

### Instructions for R2R Experiments
You can run experiments in the same way as Maze experiments with a yaml file `common/(algorithm_name)/config/r2r.yaml`. See the description above for details.

### Instructions for V2V Experiments
You can use `common/(algorithm_name)/config/v2v_color.yaml` for V2V-Reach and `common/(algorithm_name)/config/v2v_open.yaml` for V2V-Open. See the description above for details.

---
The description below is from the original README of D4RL.

# D4RL: Datasets for Deep Data-Driven Reinforcement Learning
[![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)

[![License](https://licensebuttons.net/l/by/3.0/88x31.png)](https://creativecommons.org/licenses/by/4.0/)

D4RL is an open-source benchmark for offline reinforcement learning. It provides standardized environments and datasets for training and benchmarking algorithms. A supplementary [whitepaper](https://arxiv.org/abs/2004.07219) and [website](https://sites.google.com/view/d4rl/home) are also available.

## Setup

D4RL can be installed by cloning the repository as follows:
```
git clone https://github.com/rail-berkeley/d4rl.git
cd d4rl
pip install -e .
```

Or, alternatively:
```
pip install git+https://github.com/rail-berkeley/d4rl@master#egg=d4rl
```

The control environments require MuJoCo as a dependency. You may need to obtain a [license](https://www.roboti.us/license.html) and follow the setup instructions for mujoco_py. This mostly involves copying the key to your MuJoCo installation folder.

The Flow and CARLA tasks also require additional installation steps:
- Instructions for installing CARLA can be found [here](https://github.com/rail-berkeley/d4rl/wiki/CARLA-Setup)
- Instructions for installing Flow can be found [here](https://flow.readthedocs.io/en/latest/flow_setup.html). Make sure to install using the SUMO simulator, and add the flow repository to your PYTHONPATH once finished.

## Using d4rl

d4rl uses the [OpenAI Gym](https://github.com/openai/gym) API. Tasks are created via the `gym.make` function. A full list of all tasks is [available here](https://github.com/rail-berkeley/d4rl/wiki/Tasks).

Each task is associated with a fixed offline dataset, which can be obtained with the `env.get_dataset()` method. This method returns a dictionary with:
- `observations`: An N by observation dimensional array of observations.
- `actions`: An N by action dimensional array of actions.
- `rewards`: An N dimensional array of rewards.
- `terminals`: An N dimensional array of episode termination flags. This is true when episodes end due to termination conditions such as falling over. 
- `timeouts`: An N dimensional array of termination flags. This is true when episodes end due to reaching the maximum episode length.
- `infos`: Contains optional task-specific debugging information.

You can also load data using `d4rl.qlearning_dataset(env)`, which formats the data for use by typical Q-learning algorithms by adding a `next_observations` key.

```python
import gym
import d4rl # Import required to register environments

# Create the environment
env = gym.make('maze2d-umaze-v1')

# d4rl abides by the OpenAI gym interface
env.reset()
env.step(env.action_space.sample())

# Each task is associated with a dataset
# dataset contains observations, actions, rewards, terminals, and infos
dataset = env.get_dataset()
print(dataset['observations']) # An N x dim_observation Numpy array of observations

# Alternatively, use d4rl.qlearning_dataset which
# also adds next_observations.
dataset = d4rl.qlearning_dataset(env)
```

Datasets are automatically downloaded to the `~/.d4rl/datasets` directory when `get_dataset()` is called. If you would like to change the location of this directory, you can set the `$D4RL_DATASET_DIR` environment variable to the directory of your choosing, or pass in the dataset filepath directly into the `get_dataset` method.

### Normalizing Scores
You can use the `env.get_normalized_score(returns)` function to compute a normalized score for an episode, where `returns` is the undiscounted total sum of rewards accumulated during an episode.

The individual min and max reference scores are stored in `d4rl/infos.py` for reference.

## Algorithm Implementations

We have aggregated implementations of various offline RL algorithms in a [separate repository](https://github.com/rail-berkeley/d4rl_evaluations). 

## Off-Policy Evaluations

D4RL currently has limited support for off-policy evaluation methods, on a select few locomotion tasks. We provide trained reference policies and a set of performance metrics. Additional details can be found in the [wiki](https://github.com/rail-berkeley/d4rl/wiki/Off-Policy-Evaluation).

## Recent Updates

### 2-12-2020
- Added new Gym-MuJoCo datasets (labeled v2) which fixed Hopper's performance and the qpos/qvel fields.
- Added additional wiki documentation on [generating datasets](https://github.com/rail-berkeley/d4rl/wiki/Dataset-Reproducibility-Guide).


## Acknowledgements

D4RL builds on top of several excellent domains and environments built by various researchers. We would like to thank the authors of:
- [hand_dapg](https://github.com/aravindr93/hand_dapg) 
- [gym-minigrid](https://github.com/maximecb/gym-minigrid)
- [carla](https://github.com/carla-simulator/carla)
- [flow](https://github.com/flow-project/flow)
- [adept_envs](https://github.com/google-research/relay-policy-learning)

## Citation

Please use the following bibtex for citations:

```
@misc{fu2020d4rl,
    title={D4RL: Datasets for Deep Data-Driven Reinforcement Learning},
    author={Justin Fu and Aviral Kumar and Ofir Nachum and George Tucker and Sergey Levine},
    year={2020},
    eprint={2004.07219},
    archivePrefix={arXiv},
    primaryClass={cs.LG}
}
```

## Licenses

Unless otherwise noted, all datasets are licensed under the [Creative Commons Attribution 4.0 License (CC BY)](https://creativecommons.org/licenses/by/4.0/), and code is licensed under the [Apache 2.0 License](https://www.apache.org/licenses/LICENSE-2.0.html).


