#  APPO : Adversarial Preference-based Policy Optimization

The official implementation for \<Adversarial Policy Optimization for Preference-based Reinforcement Learning\>

## Dependencies

Install pacakges with `environment.yml` file
```
conda env create -f environment.yml
pip install git+https://github.com/Farama-Foundation/Metaworld.git@master#egg=metaworld
```

To install packages manually,
```
conda create -n appo python=3.8
conda install pytorch torchvision torchaudio pytorch-cuda=12.1 -c pytorch -c nvidia
conda install tensorboard ipykernel matplotlib seaborn
pip install "gym[mujoco_py,classic_control]==0.23.0"
pip install pyrallis tqdm
pip install git+https://github.com/Farama-Foundation/Metaworld.git@master#egg=metaworld
```

## Datasets
Meta-world `medium-replay` dataset is available in the official repository of [LiRE](https://github.com/chwoong/LiRE). Meta-world `medium-expert` dataset was collected by the code provided in the official repository of [IPL](https://github.com/jhejna/inverse-preference-learning).


## Training
The parameters are in the configuration files under `configs/`.
Set learning rates, network architectures, batch sizes, and other algorithmic hyperparameter by modifying config files.

To train reward model in dial-turn task,
```
python reward_learning/learn_reward.py --config=configs/dial-turn-v2/reward.yaml
```
To train APPO in dial-turn task,
```
python appo.py --config=configs/dial-turn-v2/appo.yaml
```
To train MR in dial-turn task,
```
python mr.py --config=configs/dial-turn-v2/mr.yaml
```

## Results
The training results are stored in `log/`.
All experiments were run for 5 random seeds each and learning curves are smoothed by exponential averaging with factor 0.5.
Plots are created with `plotter.ipynb`.


## Reference

Our code is based on the official implementation of \<Listwise Reward Estimation for Offline Preference-based Reinforcement Learning\> (Choi et al., 2024) : [https://github.com/chwoong/LiRE](https://github.com/chwoong/LiRE) 