# Bayesian Query-Efficient Offline Preference-Based Reinforcement Learning via In-Dataset Exploration

## Usage 
Paper results were collected with [Mujoco 1.50](http://www.mujoco.org/) (and [mujoco-py 1.50.1.1](https://github.com/openai/mujoco-py)) in [OpenAI gym 0.17.0](https://github.com/openai/gym) with the [D4RL datasets](https://github.com/rail-berkeley/d4rl) and [Meta-World](https://github.com/Farama-Foundation/Metaworld). Networks are trained using [PyTorch 1.4.0](https://github.com/pytorch/pytorch) and Python 3.6.

## Reward model training
```
CUDA_VISIBLE_DEVICES=0 python -m JaxPref.new_preference_reward_main --use_human_label False --transformer.embd_dim 256 --transformer.n_layer 1 --transformer.n_head 4 --logging.output_dir './logs/pref_reward' --batch_size 256 --num_query 20 --query_len 50 --n_epochs 10000 --skip_flag 0 --seed 0 --model_type PrefTransformer --config configs/adroit_config.py --env metaworld_push-v2
```

## Offline training
```
CUDA_VISIBLE_DEVICES=0 python train_offline_ensemble.py --seq_len 50 --eval_interval 5000 --config configs/adroit_config.py --eval_episodes 10 --use_reward_model True --env_name metaworld_push-v2
```