# UDG Code and Models
## Introduction
This repository is a example code repo of paper UDG including part of the network models and data. UDG has two training stages:
1. Unsupervised policy training and data generation. In directory `url-data`.
2. Offline policy training. In directory `mopo-local`.
In the following sections we will introduce how to configure and install `url-data` and `mopo-local`. And then we provide examples for reproducing the results.
## Configuration
Ensure Anaconda is up-to-date and create a conda environment with 
```
conda create -n udg python=3.8
```
Install Mujoco license (version 210) and mujoco-py
Go into `url-data` directory and run
```
pip install -e . -r requirements.txt
```
Go into `mopo-local` and manually install the dependencies
```
pip install -r environment/requirements.txt
```
Note: install matplotlib may require freetype, png installed.
Go into `mopo-local/d4rl/dm_control` and install dm_control with
```
pip install -e .
```
Go into `mopo-local/d4rl/mjrl` and install mjrl.
Go into `mopo-local/d4rl` and install d4rl.
Go into `mopo-local/serializable` and install serializable.
Go into `mopo-local/url_oofline` and install url_offline.
Go into `mopo-local` and install mopo-local.
Assign a environment variable of the root directory to store model and data. Models, logs and data generated by this code will be stored here.
```
export UDG_DATA_PATH=/path/to/data
```
## Examples
Train 10 policies with WURL in Ant-Angle task: (rc denotes the reward coefficient for task reward, src denotes the coefficient for diversity reward)
```
python url-data/train_wasserstein_ro.py --num_modes 10 --env-name AntCustom-v2 --rc 0.0 --src 1.0
```
Train 5 policies with WURL in Cheetah-Jump task:
```
python url-data/train_wasserstein_ro.py --num_modes 5 --env-name CheetahJump-v2 --rc 0.0 --src 1.0
```
Generate 1M transitions with one policy model: (log model directory, time, step, mode (0-9 for example) at correspoding locations in `generator_v1.py`)
```
python url-data/generator_v1.py --env-name AntCustom-v2 --target_size 1000000
```
After generating data for all policies, evaluate the datasets:
```
python url-data/reeval-ant.py
```
This step will generate a `reward_matrix.np` file to log the average return of all data buffers in different tasks. This matrix is used for MOPO to select the best data buffer.
Train offline algorithms on Ant-Angle tasks:
```
python mopo-local/transfer_ant.py
```
We should specify the path to data buffers (the root directory of all policies) and hyperparameters `rollout_length` and `penalty_coeff` in the file `transfer_ant.py`.
This step will first write environment config into file `mopo-local/url_offline/gym_mujoco/__init__.py` and then create a config file in directory `mopo-local/examples/config/url/`. And then begin the transition model learning and offline training.
## Notes
A 1M dataset costs about 150M disk space. Therefore we could not provide sample datasets for training in this repo. We provide our actor-critic models used for data generation and corresponding `reward_matrix.np`.