#  PVO : Preference-based Value Optimization

The official implementation for \<Offline Preference-based Value Optimization\>

## Requirements

Due to package conflict, we used separate python environments for DMControl and Meta-World experiments.

For Meta-World exeperiments, install pacakges with `env_metaworld.yml` file and install Meta-World.
```
conda env create -f env_metaworld.yml
pip install git+https://github.com/Farama-Foundation/Metaworld.git@master#egg=metaworld@d2338dcaea40e7dc521f172d3afd05eb71f60498
```
For DMControl experiments, install packages with `env_dmc.yml` file.

## Datasets
DMControl `medium-replay` dataset and Meta-world `medium-replay` dataset are available in the official repository of [LiRE](https://github.com/chwoong/LiRE). Meta-world `medium-expert` dataset was collected by the code provided in the official repository of [IPL](https://github.com/jhejna/inverse-preference-learning).


## Training
The configuration files are under `configs/`.

To train reward model with Meta-World `medium-replay` dial-turn dataset,
```
python reward_learning/learn_reward.py --config=configs/mw_medium-replay/dial-turn-v2/reward.yaml
```
To train PVO with Meta-World `medium-replay` dial-turn dataset,
```
python pvo.py --config=configs/mw_medium-replay/dial-turn-v2/pvo.yaml
```
APPO and IQL are implemented in `appo.py` and `mr.py`.
TD3+BC, XQL, and their variants with value alignment loss are implemented in `td3bc.py`, `td3bc_va.py`, `xql.py`, and `xql_va.py`.

## Reference

Our code is based on the official implementation of \<Listwise Reward Estimation for Offline Preference-based Reinforcement Learning\> (Choi et al., 2024) : [https://github.com/chwoong/LiRE](https://github.com/chwoong/LiRE) and \<Adversarial Policy Optimization for Offline Preference-based Reinforcement Learning\> (Kang and Oh, 2025) : [https://github.com/oh-lab/APPO](https://github.com/oh-lab/APPO).