# DPO codes

*Pytorch Implemetation of DPO.

**Important Notes**

This repository is based on [rlswiss](https://github.com/KamyarGh/rl_swiss), which is an extension from the August 2018 version of [rlkit](https://github.com/vitchyr/rlkit). Since then the design approaches of rlswiss and rlkit have deviated quite a bit, and it is for this reason that we are releasing rlswiss as a separate repository. *If you find this repository useful for your research/projects, please cite this repository as well as [rlkit](https://github.com/vitchyr/rlkit).*

# Algorithms

Implemented RL algorithms:
- Soft-Actor-Critic (SAC)

Implemented LfD algorithms:
- Adversarial methods for Inverse Reinforcement Learning
    - AIRL / GAIL / FAIRL / Discriminator-Actor-Critic
- Behaviour Cloning
- DAgger

Implemented LfO algorithms:

- BCO
- GAIfO
- DPO

# How to run

Notes:
- First appropriately modify rlkit/launchers/config.py
- run_experiment.py calls srun which is a SLURM command. You can use the `--nosrun` flag to not use SLURM and use your local machine instead.
- The expert demonstrations and state marginal data used for imitation learning experiments can be found at [THIS LINK](https://drive.google.com/drive/folders/1jwKb5FjFtAlvBUDdHiHJN0i7PsBCthfg?usp=sharing). To use them please download them and modify the paths in expert_demos_listing.yaml.
- The yaml files describe the experiments to run and have three sections:
..* meta_data: general experiment and resource settings
..* variables: used to describe the hyperparameters to search over
..* constants: hyperparameters that will not be searched over
- The conda env specs are in rl_swiss_conda_env.yaml. You can refer to [THIS LINK](https://docs.conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html#create-env-from-file) for notes on how to set up your conda environment using the rl_swiss_conda_env.yaml file.
- You need to have Mujoco and mujoco-py installed.
- Due to a minor dependency on rllab, you would have to also install rllab. I will try to remove this dependency in future versions. The dependency is that run_experiment.py calls build_nested_variant_generator which uses something from rllab.

## Reproducing Imitation Learning Results
### Training Expert Policies
Train an SAC agent and collect expert demos using [this repo](https://github.com/Ericonaldo/Softlearning):
### Training LfO Agents

BCO

```bash
python run_experiment.py --nosrun -e exp_specs/gail_lfo_exps/bco_hopper_4.yaml
```
GAIfO

```bash
python run_experiment.py --nosrun -e exp_specs/gail_lfo_exps/gailfo_hopper_4.yaml
```

GAIfO-DP

```bash
python run_experiment.py --nosrun -e exp_specs/gail_lfo_exps/gailfo_dp_hopper_4.yaml
```

DPO w.o. PG

```bash
python run_experiment.py --nosrun -e exp_specs/gail_lfo_exps/sl_lfo_hopper_4.yaml
```

DPO

```bash
python run_experiment.py --nosrun -e exp_specs/gail_lfo_exps/dpo_union_hopper_4.yaml
```

DPO

```bash
python run_experiment.py --nosrun -e exp_specs/gail_lfo_exps/dpo_union_hopper_4.yaml
```

DPO w cycle training (w/w.o. multi-step)

```bas
python run_experiment.py --nosrun -e exp_specs/gail_lfo_exps/dpo_union_ms_cycle_hopper_4.yaml
```

DPO w multi-step regulazation

```bash
python run_experiment.py --nosrun -e exp_specs/gail_lfo_exps/dpo_union_ms_hopper_4.yaml
```