# Successor Features implementation in PyTorch

## Requirements
We assume you have access to a gpu that can run CUDA 11.2.
Make sure you checkout with the submodule tag to grab the dependencies:
```
git clone --recurse-submodules <git-url>
```

 Then, the simplest way to install all required dependencies is to create an anaconda environment and activate it:
```
conda env create -f conda_env.yml
source activate sac_sf
```

To install the third parties:
```
bash setup_thirdparty.sh
```

## Supported Environments
- [dm_control](https://gitlab.com/melfm/dm_control)
- [Metaworld](https://gitlab.com/melfm/metaworld)

Both of these environments were modified. For `dm_control` just use the `master` branch, for `metaworld`, use the `acsf` branch.

## Instructions
To train a SAC agent on the `reacher easy` task run:
```
python train_expert.py env=reacher_easy experiment=myExperiment
```

The script `train_expert.py` is just the vanilla SAC training from scratch.

This will produce `runs` folder, where all the outputs are going to be stored including train/eval logs, tensorboard blobs, and evaluation episode videos. One can attacha tensorboard to monitor training by running:
```
tensorboard --logdir runs
```

There are 3 steps of training:
```
1- Train one expert on all tasks (this is needed for the feature learning part)
2- Train the successor feature representation learning
3- Train the Successor Feature based SAC policy.
```

### Expert Training
First we train an expert policy on all the tasks. This is just training a SAC policy. Example commands are:

### DMControl Reacher
```
python train_expert.py env=reacher_easy experiment=ExpertReacher goal_mode=multi_goal seed=1
```

### MetaWorld
This by default will train the tasks `reacher`, `door_close` and `reach-wall`. But you can modify these inside `metaworld/envs/mujoco/env_dict.py` by adding/removing environments from `ML2_V2`.

```
python train_expert.py env=metaworld experiment=CustomMultiTaskBenchmark
```

### Regressing w & Phi
To learn the successor features:

```
python train_phi_w.py env=metaworld experiment=LearnPhiWMetaworld representation.latent_size=14 expert_model_date=2021.11.13seed=1
```
The above command will train representations with the metaworld benchmark, setting the `phi` hidden dimension to `512`. This also requires `expert_model_date` the date of the expert model to be loaded. Currently the `self.expert_dir` is hard-coded inside the [code](https://gitlab-master.nvidia.com/mmozifian/pytorch_sac_sf/-/blob/master/train_phi_w.py#L165) - sorry! :( Change this to match the complete name of the experiment directory including the seed like `metaworld_CustomMultiTaskBenchmark_env=metaworld,experiment=CustomMultiTaskBenchmark/seed=1/`

And to train the ACSF policy, again annoyingly, make sure the model [path](https://gitlab-master.nvidia.com/mmozifian/pytorch_sac_sf/-/blob/master/train.py#L196) is correct like `phi_w_dir += 'LearnPhiReachDoorRWall14_env=metaworld,experiment=LearnPhiReachDoorRWall14,representation.latent_size=14/seed=1/'` - note that its appending the benchmark so skip that for this one - and the latent dimensions also must match (whatever size you trained your latent feature sizes.)

```
python train.py env=metaworld experiment=SFMetaworld representation.latent_size=14 phi_w_model_date=2021.11.13 seed=1
```

Note that this requires the command line argument `phi_w_model_date` which fetches the trained `phi&W` models.


### Regressing W only
Example command for running single_goal and one feature fixed, regressing the other. Note: This is currently disabled in the code since this was intended as a sanity check and usually we are interested
in regressing both Phi and W jointly.
```
python train_phi_w.py env=reacher_easy experiment=LearnWFixPhi learn_w=true learn_phi=false
```

### Regressing Phi only
```
python train_phi_w.py env=reacher_easy experiment=LearnWFixPhi learn_w=false learn_phi=true
```

Note: The results above might assume certain modifications to the reward function of the task which would be inside the custom version of [dm_control](https://gitlab.com/melfm/dm_control) & [dmc2gym](https://gitlab.com/melfm/dmc2gym) branches. Also the initial pose of the arm and target goal locations are fixed, an environment id is required to determine the position of the goal. [Metaworld](https://gitlab.com/melfm/metaworld) reward functions were kept the same (the only modification was to allow creating custome environment benchmarks for multi-task).