# Advantage-Aware Policy Optimization for Offline Reinforcement Learning

This directory is the implementation for our proposed Advantage-Aware Policy Optimization (A2PO).

## Requirement
- [PyTorch 2.0.0](https://github.com/pytorch/pytorch)
- [Python 3.8.10](https://www.python.org/downloads/release/python-3810/)
- [OpenAI gym 0.21.0](https://github.com/openai/gym)
- [mujoco-py 2.0.2.0](https://github.com/openai/mujoco-py)

## Datasets
[D4RL datasets](https://github.com/rail-berkeley/d4rl) were used for test in this paper. We take the locomotion tasks and navigation tasks for evaluation.

#### locomotion tasks
For locomotion tasks, three tasks were used in this paper:
- _halfcheetah-v2_
- _hopper-v2_
- _walker2d-v2_

and each of the task has datasets containing different polices:

- *random*
- *medium*
- *expert*
- *medium-replay*
- *medium-expert*
- *random-medium*
- *random-expert*
- *random-medium-expert*

The first 5 datasets are given in [D4RL datasets](https://github.com/rail-berkeley/d4rl) , while the last 3 datasets are manually constructed.

It should be noted that since the maximum file size of supplementary material is 100MB and the manually constructed datasets are large, we do not provide the  manually constructed mixed-quality dataset in the directory.

#### Navigation tasks
For navigation tasks, six tasks with dataset generated by **expert** behavior policy were used in this paper:
- _maze2d-umaze-v1_
- _maze2d-medium-v1_
- _maze2d-large-v1_
- _antmaze-umaze-v1_
- _antmaze-medium-diverse-v1_
- _antmaze-large-diverse-v1_



## Usage

The paper results can be reproduced by :
```
python main.py --env=<env_name> --seed=<seed_id>
```

If want to see influence of different components, the command can be extent as bellow:

```
python main.py --env=<env_name> --use_discrete=<Bool> --epsilon=<epsilon> --vae_step=<vae step> --seed=<seed_id>
```



